Learn essential data preparation techniques for tabular and time series data using H2O Driverless AI. Understand data quality, build datasets, and customize preprocessing with Python.
Ideal for those looking to enhance their machine learning projects with Driverless AI.
What you'll learn
Data Quality for Machine Learning Understand why clean and well-structured data is critical for model success.
Tabular Data Preparation Learn the basics of supervised and unsupervised learning, tabular formats, and unit of analysis.
Custom Preprocessing in Driverless AI
See how to prepare datasets, automate tasks, and use Python code for custom preprocessing.
Time Series Data Basics
Get familiar with time series concepts like date columns, autoregressive models, and multiple series handling.
Dataset Splitting Techniques
Learn practical strategies to split and prepare time series datasets for model training..
Best Practices in Data Prep
Understand common challenges and apply solutions to improve model performance.
Course Playlist on YouTube
Join us for an insightful course on data preparation for machine learning with Driverless AI!
Explore fundamental principles, key concepts, and essential steps for effective data prep.
This course is divided into two chapters for structured learning:
- Chapter One: Understanding the importance of data preparation, highlighting the classical Tabular format's role in traditional machine learning.
- Chapter Two: Delving into advanced concepts tailored for time series data analysis.
Enjoy!
Join Jonathan Farinela, Solutions Engineer at H2O.ai, as he delves into the crucial process of data preparation for machine learning.
In this informative session, Jonathan will cover essential principles, concepts, and steps required to ensure your data is ready for effective machine learning models.
From understanding the significance of data quality to exploring the classical Tabular format and its relevance in machine learning, this session will equip you with the foundational knowledge needed to excel in data preparation for ML projects.
Don't miss this opportunity to enhance your skills and optimize your data for successful machine learning outcomes.
Take a look at the latest updates in H2O.ai Aquarium, featuring a refreshed user interface and seamless integration with H2O.ai University.
These enhancements make it easier than ever to access hands-on labs and learning resources, all in one place.
▶ Start exploring Aquarium here: https://aquarium.h2o.ai/
▶ Learn more at H2O.ai University: https://h2o.ai/university/
In this video, Jonathan Farinela presents an in-depth guide to preparing and exploring data while utilizing Driverless AI for machine learning tasks. Here's what you'll find:
1. Introduction to Dataset Perspective: Get an overview of the dataset used, sourced from a Kaggle classification problem.
2. Exploring Data in Driverless AI: Discover how to import data into Driverless AI and analyze its characteristics using descriptive statistics.
3. Addressing Data Quality Issues: Delve into common issues like missing values, duplicates, and outliers, with Driverless AI offering automated solutions or customizable data recipes.
4. Data Preparation Actions: Explore a range of data preparation tasks within Driverless AI, including format adjustments, column management, and data filtering.
5. Best Practices for Data Preparation: Learn essential strategies for machine learning data prep, focusing on dataset size, usage frequency, and the significance of comprehensive information during model training.
6. Target Leakage and Avoidance: Understand the risks of target leakage, where irrelevant data seeps into training sets, and how Driverless AI aids in identifying and preventing such issues.
7. Conclusion: Wrap up with a summary of data prep principles covered in the video, with a nod to future topics like time series data preparation.
Here's a reminder of how to access and use H2O.ai Driverless AI via our Aquarium platform, in case you did not have the occasion to review the video https://youtu.be/2V9XCT7dDqk
1. Create an Account:
- Go to aquarium.h2o.ai in your web browser.
- If you're not registered, select "Create a new account."
- Fill in the required fields and click "Create account and receive temporary password."
- Check your email for the temporary password.
2. Login:
- Return to the login page, enter your credentials, and click "Login."
- Confirm you're not a robot by checking the corresponding box.
3. Explore Labs:
- If no labs are active upon login, click "Browse Labs" to see available options.
4. Choose a Lab:
- Explore different labs, including Driverless AI GUI and others.
- Specifically, opt for "Lab 4: Driverless AI Training (1.10.5 LTS)" to access Driverless AI.
5. Start the Lab:
- Review the lab details and duration.
- Click "Start lab" to initiate your session.
6. Wait for Lab Launch:
- Wait briefly for the lab to start; this may take a few minutes.
- Keep an eye on the lab's instance details, including its ID and status.
7. Access the Lab:
- Once the lab instance is ready, use the provided hyperlink to access it.
- Note the remaining time for your session.
Wishing you success on your learning journey!
Welcome back to our exploration of data preparation with H2O.ai Driverless AI. In this video, we delve into the essential principles and best practices necessary for effective time series analysis.
Time series problems revolve around target variables intertwined with date information. This necessitates the inclusion of a date column in both training and test datasets, forming the cornerstone of time series analysis.
A minimal dataset for time series analysis with Driverless AI comprises just two columns: a date column and a target variable, aligning with the autoregressive nature of time series problems.
Handling multiple series within a dataset is common, requiring strategic approaches for effective model fitting. Grouped series, divided by factors such as stores or regions, demand careful consideration to ensure optimal model performance.
Best practices include enriching datasets with additional information beyond date and target, such as holiday indicators or marketing campaign data.
Dataset size is crucial, with training data ideally three times larger than the test dataset, and the test dataset aligning with the forecast horizon. Consideration of the age of data is vital, as older data may introduce noise and impact model accuracy.
Reframing the problem to align with business objectives and determining the appropriate granularity of the data are critical steps in optimizing model performance.
1
0:45
Introduction to the DataPrep for DriverlessAI Course
2
6:46
Machine Learning Data Prep Basics
3
5:57
Using H2O.ai Aquarium Labs | Latest Update
4
21:00
Data Exploration Simplified: Guide to Driverless AI Prep
5
20:07
Quick Overview: Time Series Data Prep with H2O.ai Driverless AI
Statistician with over a decade of experience in analytics and data science, primarily working in Research and Development, also has experience with Demand Forecast for retail and CRM for Financial Services. Last 5+ years focusing on helping customers during pre and post sales steps, leading and conducting Proof of Values (POV) with AI and ML projects, educating, enabling, and driving AI solutions for business problems, from data to value always aiming ROI and financial impact for the business.