August 15th, 2018

The different flavors of AutoML

RSS icon RSS Category: AutoML, Data Science, H2O, H2O Driverless AI
Ice cream banner

In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software (e.g. H2O, scikit-learn, keras). Although these tools have made it easy to train and evaluate machine learning models, there is still a good amount of data science knowledge that’s required in order to create the highest-quality model, given your dataset. Writing the code to perform a hyperparameter search over many different types of algorithms can also be time consuming and repetitive work.

What is AutoML?

The term “AutoML” (Automatic Machine Learning) refers to automated methods for model selection and/or hyperparameter optimization. AutoML is also a subfield of machine learning that has a rich academic history, an annual workshop at the International Conference on Machine Learning (ICML), and academic research labs devoted to this topic (e.g. University of Freiburg Machine Learning Lab in Germany).

The AutoML field began by developing methods for automating hyperparameter optimization in single models, and now includes such techniques as automated stacking (ensembles), neural architecture search, pipeline optimization and feature engineering.

AutoML Tools

The goal of AutoML software is two-fold:

  1. To enable non-experts to train high quality machine learning models.
  2. To improve the efficiency of finding optimal solutions to machine learning problems.

There are a handful of different AutoML platforms (open source, closed source and SaaS), aiming to solve different types of supervised machine learning problems. AutoML tools can largely be categorized by use-case or more simply, by the format of the training data.

  • IID tabular data (numeric and/or categorical data)
  • Time-series tabular data (numeric and/or categorical data with a time-dependency)
  • Raw text data (text classification)
  • Raw image data (image classification)

Multiple Algorithms

Although some tools handle multiple domains, the majority of AutoML tools are designed for the most common use-case which is IID tabular data (a table with rows and columns). In open source, that would include tools such as H2O AutoML, auto-sklearn (along with it’s predecessor, Auto-WEKA) and TPOT. H2O.ai’s Driverless AI is a platform that’s geared towards IID tabular data, but also supports time-series data and raw text. In the case of Driverless AI, automatic feature generation is also part of the AutoML process (and one of the key differentiators between open source H2O AutoML and Driverless AI).

Since there is no single algorithm that consistently performs the best across all datasets (a consequence of the “No Free Lunch Theorem”), these AutoML tools explore a variety of algorithms such as Gradient Boosting Machines (GBMs), Random Forests, GLMs, and in some cases also consider Deep Neural Networks. Another approach shared by most of these tools is ensembling several models together to get a stronger final model, a technique which wins a majority of Kaggle competitions.

Deep Learning: Neural Architecture Search

On the other end of the spectrum, a technique most commonly used in the domain of image classification problems, is an AutoML method called “neural architecture search” (NAS). When training a Deep Neural Network, there are many hyperparameters to tune (e.g. learning rate, batch size, dropout rate), however one of the biggest contributors to model performance is the network architecture. Two areas where Deep Neural Networks have improved performance over traditional machine learning methods is image classification and natural language processing (NLP).

There are a handful of open source tools for neural architecture search, including a TensorFlow and PyTorch implementation of Efficient Neural Architecture Search (ENAS), and Auto-Keras which performs an efficient NAS using Bayesian Hyperparamter Optimization. On the SaaS side, Google Cloud offers AutoML Cloud Vision and Cloud Natural Language. They perform a neural architecture search for image and text classification, respectively. These tools require raw image or text data directly, and therefore are not appropriate for your typical numeric or categorical tabular data.

How can AutoML Help You?

If you’re part of the majority of data scientists who work with tabular or “relational” data (tables with numeric and/or categorical columns), then H2O AutoML or Driverless AI are great tools to use.

2017 Kaggle.com Data Science Survey Results: What type of data is used at work?

However, if you have image data, Google Cloud AutoML Vision is an option, however, there are several open source tools (listed above) that will do the same thing at no cost, and allow you to keep your data off the cloud. Another option is to skip the neural architecture search altogether, and use a pre-trained image classification model combined with transfer learning, as explained in this article by fast.ai co-founder, Rachel Thomas.

In the case of text data, you could use a general purpose tool such as Driverless AI, or a tool specifically designed for text such as Google Cloud Natural Language. If you prefer an open source solution, you can apply H2O’s Word2Vec algorithm (or any other open source text processing tool) to convert the text into a numeric format which can be used by H2O AutoML (direct support for text data is on the H2O AutoML roadmap).

Conclusion

There are many tools out there and each tool is typically designed for a specific use-case or set of use-cases in mind. There are many options to consider, including data preparation, data privacy, cost, modeling options (types of algorithms inside), deployment options and ease-of-use. We hope that you’ll give H2O AutoML and Driverless AI a try!

About the Author

Erin LeDell
Erin LeDell

Erin is the Chief Machine Learning Scientist at H2O.ai. Erin has a Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on automatic machine learning, ensemble machine learning and statistical computing. She also holds a B.S. and M.A. in Mathematics. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE Digital in 2016) and Marvin Mobile Security (acquired by Veracode in 2012), and the founder of DataScientific, Inc.

Leave a Reply

+
H2O LLM DataStudio Part II: Convert Documents to QA Pairs for fine tuning of LLMs

Convert unstructured datasets to Question-answer pairs required for LLM fine-tuning and other downstream tasks with

September 22, 2023 - by Genevieve Richards, Tarique Hussain and Shivam Bansal
+
Building a Fraud Detection Model with H2O AI Cloud

In a previous article[1], we discussed how machine learning could be harnessed to mitigate fraud.

July 28, 2023 - by Asghar Ghorbani
+
A Look at the UniformRobust Method for Histogram Type

Tree-based algorithms, especially Gradient Boosting Machines (GBM's), are one of the most popular algorithms used.

July 25, 2023 - by Hannah Tillman and Megan Kurka
+
H2O LLM EvalGPT: A Comprehensive Tool for Evaluating Large Language Models

In an era where Large Language Models (LLMs) are rapidly gaining traction for diverse applications,

July 19, 2023 - by Srinivas Neppalli, Abhay Singhal and Michal Malohlava
+
Testing Large Language Model (LLM) Vulnerabilities Using Adversarial Attacks

Adversarial analysis seeks to explain a machine learning model by understanding locally what changes need

July 19, 2023 - by Kim Montgomery, Pramit Choudhary and Michal Malohlava
+
Reducing False Positives in Financial Transactions with AutoML

In an increasingly digital world, combating financial fraud is a high-stakes game. However, the systems

July 14, 2023 - by Asghar Ghorbani

Ready to see the H2O.ai platform in action?

Make data and AI deliver meaningful and significant value to your organization with our state-of-the-art AI platform.