October 16th, 2018

How This AI Tool Breathes New Life Into Data Science

RSS icon RSS Category: Beginners, Data Journalism, Data Science, Deep Learning, Driverless, Explainable AI, GPU, H2O Driverless AI, Machine Learning, NLP, Python, R, Technical
Fallback Featured Image

Ask any data scientist in your workplace. Any Data Science Supervised Learning ML/AI project will go through many steps and iterations before it can be put in production. Starting with the question of “Are we solving for a regression or classification problem?”

  1. Data Collection & Curation
  2. Are there Outliers? What is the Distribution? What do we do with what we have?
  3. Feature Engineering & Class imbalance adjustments – Encoding/Munging, etc. How to handle text, numeric, categorical, binary data types?
  4. Careful Separation of Training, Validation and Test data sets
  5. Algorithm and Model Selection, Tuning and Cross-Validation
  6. Run it on a cloud or on-prem platform where algorithms (not just Tensorflow) run parallel on GPUs (100x faster). How about using the latest Gradient Boosting and XGBoost algorithms, parallelized on GPUs?
  7. Are we avoiding Overfitting?
  8. Bonus points – Run multiple Ensembles from the best winning algorithms and its parameters and choose the winning ensemble.
  9. Build Scoring pipeline to be used right away in production just with some basic hooks
  10. Repeat and Rinse above every-time you get new data or when models decay.
  11. Explain to business how every model is doing, what it is doing – ALL THE TIME.

Every step above has a myriad of challenges. Today, all the above are very painful to do, even for the best data scientists in a team. If your data has 200 columns are features that are a mix of numeric, categorical and text data, the problem becomes exponentially hard. Even the savvy data scientists avoid running exhaustive tests, procedures and scientific methods and try stick to basic stuff or may decide to automate one or two steps above. The most automation you get today is available from a few cloud/SAAS vendors – they allow you to choose the algorithm with your base features and it will do some repetitive hyper-parameter tuning to get you the best model that you can use in production.

In reality, to do above right, it takes a lot more other than repetitive model tuning or using one algorithm. We are not even talking about a big data problem – even with just 1 million rows, you could be spending days, weeks to a month to get an AUC measure of 0.94 or a really low LOG LOSS with no overfitting, only to find your next batch of data has you chasing this problem all over again.

What if you had a tool that solves the above end-to-end problem using AI?

Driverless AI from H2O (AI to do AI) 

Load your ground truth CSV file or point to your data source and push a button. The tool does all the steps like feature engineering, evolution, etc., multiple algorithms on GPUs, etc., etc., under the hood and then outputs the code to put in production, all with an in-depth explainability report! How about getting to the leaderboard in a few minutes or an hour or so with a tool with Kaggle Grandmaster smarts?

A screenshot from above where I was building a classification model with the “Wine Data Set” from UCI ML repository. With 15 seconds of running the classification, you can see the first XGBoost model itself gave me 0.9267 AUC showing me the variable importance in the screen. As Driverless AI evolves the features and runs multiple algorithms (running on 8 GPUs in this instance), you can watch how the AUC continuously is improved by tuning LightGBM, XGBoost, TensorFlow, GLM, etc., with hyper-parameter optimization and feature evolution.

The next time I ran it, I used LOGLOSS as my scorer, and it finished with 0.7876 based on my initial settings. The evolution of both models and features is shown in this graph below:

Driverless AI can run on Docker in your PC/Mac or one of EC2, Azure or GCP (with multiple GPUs!) instances in the cloud. After you load your data and finish the experiment, you can deploy a java or python scoring pipeline package at the end that has all the workflow inside. All one has to do is to call the hooks from your production application with your new data stream to get results – whether its’ binary outcome or numeric estimate/forecast or a multi-class decision.

Does Driverless AI replace Data Scientists?

Driverless AI makes Data Scientists super productive and helps them automate the end to end process – just like using any other automated tool to make mundane/repetitive tasks exciting and efficient. The data scientists have options to set up, monitor, influence the model building and override default decisions to accelerate model building or increase accuracy further. Driverless AI can also help data scientists to build models across multiple applications simultaneously and simplify the model deployment/lifecycle chore – even think about how much time a business can save solving various complex ML/AI problems and running it in production – all without the loss of explainability.

Some links:

H2O’s Driverless AI website

Download a 21-day trial from here

Documentation

 

About the Author

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition.

Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

Leave a Reply

+
10 Consejos para Convertirte en un Científico de Datos Exitoso

En este mundo que no deja de cambiar y sorprendernos, como científicos de datos debemos

January 19, 2023 - by Favio Vázquez
+
Explaining models built in H2O-3 — Part 1

Machine Learning explainability refers to understanding and interpreting the decisions and predictions made by a

December 22, 2022 - by Parul Pandey
+
H2O.ai at NeurIPS 2022

H2O.ai is proud to participate in the 36th Conference on Neural Information Processing Systems (NeurIPS)

December 6, 2022 - by Marcos V. Conde
+
A Brief Overview of AI Governance for Responsible Machine Learning Systems

Our paper “A Brief Overview of AI Governance for Responsible Machine Learning Systems” was recently

November 30, 2022 - by Navdeep Gill, Abhishek Mathur and Marcos V. Conde
+
H2O World Dallas Customer Talks

After three long years of not having an #H2OWorld, we finally held our first one

November 24, 2022 - by Vinod Iyengar
+
New in Wave 0.24.0

Another Wave release has arrived with quite a few exciting new features. Let's quickly go

November 21, 2022 - by Martin Turoci

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More