AI/ML Projects — Don’t get stymied in the last mile

Published: May 03, 2019

min read

Written by: Karthik Guruswamy

Data Scientists build AI/ML models from data, and then deploy it to production – in addition to a plethora of tasks around data insights, data cleansing etc., Part of the Data Scientist job description/requirement is making models available for transparency, auditability as well as explainability for both regulators as well as internal business use.

While model monitoring, deployment, and data engineering fall in the infrastructure side and has challenges of its own, creating auditable, transparent and explainable AI/ML models that perform well has always been elusive to AI/ML projects.

Also, there is no easy way to build quality models “consistently” especially because it requires a lot of talent that require a “lot of” various tools, tool integration, a suite of algorithms, iterations etc., Each business problem, data set creates new challenges for data scientists. Data Science experiments are great, but cranking out industrial-strength models day after day is another story. A high rate of conversion of business problems to quality models or turn around time is what makes AI/ML initiatives become successful.

Common Challenges in the Last Mile of AI/ML Model Creation:

Algorithms to use — It is hard to determine ahead of time which algorithm/combination of algorithms or its parameters is going to be a better fit. Even though a list of top leaderboard algorithms will always be good, finding the right fit is a challenge by itself, including building an ensemble of the top N algorithms by score.
Feature Engineering — Doing a whole lot of complex and combinatorial feature engineering by data engineers ahead of time creates ‘feature zoo’ that slows down AI/ML model building. Feature engineering includes converting categorical to numeric, vice-versa, combining multiple columns, encoding, etc., Feature engineering is heavily relied on by Data Scientists to create HIGH ACCURATE models and often push that task to data engineers. Unfortunately, it’s not easy to determine “what features” are important ahead of time, unless done iteratively and tested well. If the data changes over time, new features have been discovered again, while the model lags in quality in production.
Model Documentation — Creating Documentation on the deployed models + winning features for auditability.
Model Explainability — Explaining the current model in production on how it’s deciding what it is deciding. Questions on a production model like, “What is the marginal effect of this column on the final outcome ?”, “What is the numeric cutoff point of this column after which churn drops ?” “I need the reason codes for the model prediction for customer X” etc., has to be answered …
Scoring Pipeline — Packaging the ‘scoring pipeline’ in a consistent way, that is fully portable across different environments — What’s the use of data science experiments if the output cannot be used by downstream applications? Also when data changes, features change, model changes and thus scoring pipelines need to be regenerated and can be impossible to keep up, when done manually.

Even though 80% of the data enterprise is tabular data, bringing AI/ML projects to fruition remains a challenge for most enterprises — in fact, AI/ML projects stall because of the issues mentioned above resulting in businesses stay behind the AI maturity curve.

Industrial strength AI/ML model creation on your tabular data

To cite an example of how some of the above issues are tackled, I use the Pima Indians Diabetes Data Set from National Institute of Diabetes and Digestive and Kidney Diseases (also available in Kaggle at https://www.kaggle.com/uciml/pima-indians-diabetes-database ) in the blog post.

You can read the description of the data from the above link. In essence, we are trying to build a model on the outcome of “Diabetes — Yes or No” based on historical data. In the future, when new data arrives, the model can predict if a patient has diabetes or not, learned from prior history. It’s not a big data set, but the process is no different when applied to millions rows of data – works the exact same way. I chose this example as it’s a public data set and easier to understand the outcome without being a domain expert.

Algorithms and Feature Engineering

I’m going to use H2O.ai’s Driverless AI to upload the data set in the Data Sets Page and then split them 80/20 into a diabetes_train and diabetes_test by right-clicking and choosing “Split” next to diabetes.csv

I click on “Predict” next to diabetes_train data set.

I then choose my Target variable as “Outcome” and then set the Scorer to AUC (Area Under the Curve — higher the better and lower the false positives and negatives). I also choose the test data set to diabetes_test. I then click Launch Experiment .

The Algorithms, Cross Validation Scheme, Feature Engg. to be done is all decided by Driverless (using our Kaggle Grandmaster’s recipes) and then models start getting built and experiment finishes in 13 min .

A Note on Automatic Feature Engineering in Driverless AI: Unlike the common practice of building features ahead of time (before model tuning), Driverless AI creates features on the fly using an Evolutionary technique, avoiding the exhaustive feature generation step. This results in performance, better features and less resource consumption.

I’m looking at the AUC for the test score and it is 0.85148 which is higher than the validation/ensemble score, which means Driverless AI generalized well to predict higher on the data it has not seen.

Model Documentation

Can I get the documentation please on the winning model and features? Click on Experiment Summary and find report.docx that is written for a Data Scientist.

Some more screenshots on what you can find inside.

Model Explainability

The screenshot below shows the Machine Learning Interpretability Dashboard that is derived from the final predictions. The Explainability tool is model agnostic and uses K-LIME and LIME-SUP to build surrogate models and explain away with reason codes. Also shows the final global variable importance on the top right — Glucose and BMI (Body Mass Index) comes on top :). Shapley is tucked inside as well. For individual predictions, the reason codes are localized and sometimes may vary from global importance, to explain what the model found to the business.

Scoring Pipeline

Ok, we have a great model that I explain to business— how do we deploy in production? You can simply download the python or java scoring pipeline from the finished experiment page. All the individual algorithms scoring, final ensemble scoring, feature engineering is all part of the code, that can be loaded into a CICD pipeline or an existing package distribution framework.

Conclusion

We saw how to build a world-class model, with the highest accuracy possible - without writing a single line of code. The model would be equal or better than what you can find in the blogs and was created in like 13 minutes. The tool did all the algorithm selection, feature engineering with just default settings! We were able to create a doc with the steps like a champ and then generate code to go to production and explain the model away with a dashboard. This is what industrial strength automatic machine learning looks like today. You can convert business requirements to production ready models from the data, that you have already collected and avoid time/resource consuming feature engineering and model tuning.

Driverless AI is available on-prem, on all the cloud providers and on partner hardware.

Karthik Guruswamy

Karthik is a Principal Pre-sales Solutions Architect with H2O. In his role, Karthik works with customers to define, architect and deploy H2O’s AI solutions in production to bring AI/ML initiatives to fruition. Karthik is a “business first” data scientist. His expertise and passion have always been around building game-changing solutions - by using an eclectic combination of algorithms, drawn from different domains. He has published 50+ blogs on “all things data science” in Linked-in, Forbes and Medium publishing platforms over the years for the business audience and speaks in vendor data science conferences. He also holds multiple patents around Desktop Virtualization, Ad networks and was a co-founding member of two startups in silicon valley.

BACK TO LIST