This course builds on the Level 1 overview with a deeper look at Large Language Models and practical GenAI workflows.
Learn how to work with RAG techniques, fine-tune models, prepare datasets, and evaluate performance using tools like Enterprise h2oGPTe, LLM DataStudio, EvalGPT, and the GenAI AppStore.
Led by Kaggle Grandmaster Sanyam Bhutani, the course includes Python labs, research-based materials, and guided practice using H2O.ai tools across the GenAI ecosystem.
What you'll learn
Large Language Model Fundamentals Build understanding of how LLMs work and their role in enterprise AI applications.
Fine-Tuning with LLM DataStudio
Configure and train language models using H2O's specialized fine-tuning platform.
Dataset Preparation Best Practices
Structure and prepare data effectively for training and evaluating language models.
Model Evaluation Methodologies
Use H2O.ai EvalGPT and assessment frameworks to measure model performance and quality.
H2O GenAI Platform Navigation
Work with GenAI AppStore, H2O.ai Wave, and integrated ecosystem tools for end-to-end workflows.
Course Playlist on YouTube
Welcome to our hands-on GenAI LLM training! Dive into the entire life cycle of Large Language Models (LLMs) with practical exercises.
From foundational concepts to advanced topics like RAG with LLMs, dataset prep, model fine-tuning, and app creation, we've got you covered.
Explore each step with Python notebooks and interactive exercises.
Whether you're new or experienced, this course equips you with valuable insights and skills for mastering GenAI tech. Join us!
In this video, we will dive into the basics of large language models (LLMs) pipeline. We'll explore how these models can do more than just predict the next word in a sentence – they can think, reason, and even philosophize.
We'll try to understand how LLMs are trained and why they're so powerful. Then, we'll take a look at the inner workings of LLMs, focusing on their architecture and how data preparation plays a crucial role in their performance. We'll also discuss about evaluation methods and techniques for enhancing model performance.
One exciting aspect we'll cover is retrieval-augmented generation (RAG), where LLMs use stored documents to generate responses.
We'll also touch on prompt engineering, which can further improve LLM performance, and the importance of guardrails in keeping these models on track and unbiased.
Finally, we'll briefly mention GenAI Apps – applications that leverage LLMs – and how you can explore them further.
This video aims to set the stage for practical exercises in upcoming labs. Ready to explore the world of large language models? Let's dive in!
Learn to summarize documents, create LinkedIn posts, and uncover insights with Enterprise GPTe and Python Notebooks.
Follow our hands-on exercises for practical skills and valuable insights, whether you're a beginner or advanced user.
Instructions to access Enterprise H2O GPTe: You can gain access to Enterprise H2O GPTe by logging in using your h2o.ai Managed Cloud Account, your Gmail or GitHub credentials via the following link: h2ogpte.genai.h2o.ai
Access to the h2oGPT research paper pdf file: h2oGPT: Democratizing Large Language Models
The Link for the Python LAB 1 can be found here: LAB 1 - RAG.ipynb
Please use the following link for the: rag_url = ‘https://h2ogpte.genai.h2o.ai/’
Discover the potential of Gen AI Apps in this brief video.
Explore our Gen Cloud for free access to a myriad of innovative ideas and inspirations. From Call Center GPT to Enterprise H2O GPTe, our offerings provide a glimpse into the possibilities. Dive in, browse, and experiment at your leisure—all at no cost.
This class does not have any Python Notebook Lab.
Instructions to access GenAI AppStore: You can gain access to GenAI AppStore via the following link: genai.h2o.ai/appstore
Instructions to access the H2O.ai Wave Documentation App: You can gain access to Wave App Documentation via the following link: wave.h2o.ai
Lab Three introduces fine-tuning language models. You'll learn to fine-tune both small and large models, beginning with a simple model using Hugging Face's library and progressing to larger models like GPT-3 using LLM Studio.
The process involves dataset preparation, tokenization, model setup, and training. Finally, you'll compare fine-tuning traditional models with large language models and explore dataset preparation techniques as homework.
Here's how to access LLM DataStudio for training purposes:
1. Visit our Aquarium platform at aquarium.h2o.ai.
2. Watch the following video to learn how to create an account on Aquarium: Accessing h2o.ai Aquarium Labs.
3. After you've gained access to Aquarium, navigate to the LLM Data Studio Lab.
4. Start an instance to access the user interface through the LLM Data Studio URL link at the page's bottom.
The instance will be available for you to use for 120 minutes, at the end of which all its data will be erased. Enjoy your training session with LLM Data Studio!
💡Watch our Aquarium walkthrough here: https://youtu.be/FSBlJeSadgw
The Link for the Python LAB 3 can be found here: LAB 3 - Fine Tuning.ipynb
In this fourth lab, we'll focus on dataset preparation for Downstream NLP tasks. We'll explore various techniques programmatically in Python, using libraries like PyTorch Transformers, pandas, NumPy, and Matplotlib.
The dataset we'll work with consists of LinkedIn influencer posts collected in 2021, containing metadata such as the influencer's name, number of followers, timespan, content, media type, and more. After loading the dataset into the S3 bucket, we'll examine its contents, including the number of examples and influencers.
Next, we'll sample a subset of the dataset and begin cleaning it. We'll remove profanity using a threshold approach and conduct quality checks based on the Flesh-Kincaid Grade Level. Additionally, we'll write custom functions to handle whitespace, maximum length, and column selection.
After cleaning the dataset, we'll further refine it by selecting the top-performing posts based on reactions. With the cleaned dataset in hand, we'll utilize H2O GPT to generate titles for the influencer content, employing zero-shot prompting.
For fine-tuning, we'll create instructions for H2O GPT and run it over the entire dataset. Alternatively, we'll explore LLM Data Studio, a tool specifically designed for LLM-based tasks. This tool streamlines the data preparation process by automatically converting files into question-answer pairs and providing options for cleaning, augmenting, and quality checking.
Your homework for this lab is to upload your own documents to Data Studio, experiment with different settings, and observe the outputs. Understanding the nuances of data preparation for LLMs is essential for effectively utilizing these models. Once you've completed this task, we'll move on to the final lab, where we'll learn how to evaluate LLMs.
Here's how to access LLM DataStudio for training purposes:
1. Visit our Aquarium platform at aquarium.h2o.ai.
2. Watch the following video to learn how to create an account on Aquarium: Accessing h2o.ai Aquarium Labs.
3. After you've gained access to Aquarium, navigate to the LLM Data Studio Lab.
4. Start an instance to access the user interface through the LLM Data Studio URL link at the page's bottom.
The instance will be available for you to use for 120 minutes, at the end of which all its data will be erased. Enjoy your training session with LLM Data Studio!
Please be aware that the h2oGPT exercise featured in the current video (found in the One Step Further section of LAB 4 accompanying this notebook) is solely for demonstration purposes. The endpoint used in the demonstration will not function for you.
You can access the influencers_data.csv file at the following link: LinkedIn Influencers' Data
The Link for the Python LAB 4 can be found here: LAB 4 - Data Preparation.ipynb
To access h2oGPT for learning purposes, visit our h2oGPT platform using the link provided: gpt.h2o.ai.
You'll have open access using the credentials:
username: guest
password: guest
In this final lab, you will focus on evaluating large language models (LLMs) programmatically.
You will learn to compare LLMs using methods like blue score and Rouge score, but these methods have limitations. The lab introduces a more effective approach: using a third language model as a judge to compare LLMs.
The scores are being assigned based on comparing responses from different models. GPT-3.5 is used as the judge in this case, but any model could serve. The lab concludes by encouraging you to further explore model evaluation, watch additional lectures on H2O LLM evaluation, and consider taking a quiz for certification.
Feel free to take a look at a more detailed presentation of our LLM EvalGPT app made by Andreea Turcu at the following link: Introducing H2O LLM EvalGPT
Instructions to access H2O.ai EvalGPT: You can gain access publicly to H2O.ai EvalGPT via the following link: evalgpt.ai
Please be aware that the h2oGPT exercise featured in the current video (found in the One Step Further section of LAB 4 accompanying this notebook) is solely for demonstration purposes. The endpoint used in the demonstration will not function for you.
You can access the influencers_data.csv file at the following link: LinkedIn Influencers' Data
The Link for the Python LAB 5 can be found here: LAB 5 - Evaluation.ipynb
To access h2oGPT for learning purposes, visit our h2oGPT platform using the link provided: gpt.h2o.ai.
You'll have open access using the credentials:
username: guest
password: guest
1
3:33
Mastering GenAI LLMs: Hands-On Training Guide
2
9:05
Understanding the Foundations of Large Language Models
3
7:41
Practical RAG Techniques: Interacting with Enterprise H2O GPTe
4
2:25
GenAI AppStore: Your Gateway to Innovative Solutions
5
12:55
A Comprehensive Guide to Fine-Tuning Language Models
6
12:19
Mastering Dataset Preparation: Techniques and Best Practices
7
8:41
Mastering LLM Evaluation: Metrics and Methodologies
Sanyam Bhutani is a Machine Learning Engineer and AI Content Creator at H2O.ai. He is also an inc42, Economic Times recognized Machine Learning Practitioner. (link to the interviews: inc42, Economic Times) Sanyam is an active Kaggler where he is a Triple Tier Expert, ranked in Global Top 1% in all categories as well as an active AI blogger on the medium, Hackernoon (Medium Blog link) with over 1 Million+ Views overall. Sanyam is also the host of Chai Time Data Science Podcast where he interviews top practitioners, researchers, and Kagglers. You can follow him on Twitter or subscribe to his podcast.