AI and AutoML: Debunking Myths
Our world is changing rapidly, and that implies many organizations will need to adapt quickly. AI is unlocking new potential for every enterprise. Organizations are using AI and machine learning technology to inform business decisions, predict potential issues, and provide more efficient, customized customer experiences. The results can enable a competitive edge for the business. AI empowers data teams to scale and deliver trusted, production-ready models in an easier, faster, more cost-effective way than traditional machine learning approaches.
AI and AutoML are not magic but it can be transformative. Find out how at this virtual meetup. Get practical tips and see AutoML in action with a real-world example. We’ll demonstrate how AutoML can augment your Data Scientists, supercharging your team and giving your organization the AI edge in record time.
Speakers’ Bio:
James Orton: He has over a decade of experience in analytics and data science across a number of industries. He has managed data science teams and large scale projects, before more recently launching his own startup. His vision for AI and that of H2O.ai were so closely aligned, it was a fortuitous opportunity for James to join H2O.ai in the Australia and New Zealand region.
Read the Full Transcript
SK: Wonderful. Looks like the audio is working fine. Again, a very warm welcome. Thank you for joining us today for our webinar titled AI and AutoML: Debunking Myths. I’m on the marketing team here at H2O.ai. I’d love to start off by introducing our speaker, James Orton. He has over a decade of experience in analytics and data science across a number of industries. He has managed data science teams and large scale projects, before more recently launching his own startup. His vision for AI and that of H2O.ai were so closely aligned, it was such an opportunity for James to join H2O.ai in the Australia and New Zealand region.
Before I hand it over to James, I’d like to go over the following housekeeping items. Please feel free to send us your questions throughout the session via the questions tab in your console. We’ll be happy to answer them towards the end of the session. Number two, this presentation is being recorded, a copy of the recording and slide deck will be available after the presentation is over. Without further ado, I’d now like to hand it over to James.
James Orton: Thank you, SK. I’ll get started. So first of all, I just wanted to say thanks to all the attendees for making the time and giving your attention to this virtual meetup. Today, we’re going to talk about AI and AutoML, which in some contexts can be quite divisive terms. On the one hand, people will say AI, AutoML is the panacea for all our problems. Whereas on the other hand, you’ll have people suggest it’s just hype or snake oil, or something like that. But what’s the reality? Let’s try and unpack that a bit in this speak up. Let’s unpack the terminology and some of the acronyms here. We can think about some of the cautions, understand why this is more important now than ever, and actually see it in action. So have a look at our products, the Driverless AI in action and hopefully, at the end of all of that, we’ll provide you with some insights and debunk some myths for you, whether you’re an AI practitioner, a leader or just interested in the field.
So when I was putting this slide deck together I wasn’t sure if we’d have the video on and I’m very glad that we do so that we get some degree of physical connection. So this is me, I’m James, I am based in Melbourne and covering the Australia and New Zealand region and it’s a pleasure to connect with you digitally now and hopefully, physically again sometime in the future with a real face-to-face meetup.
And who are H2O AI? So this is just a snapshot of who we are. So we are very established. We have been around for eight years and you can see some of the heavyweight investors that we have there, on the top right hand side. We started with open source solutions which we continue to invest in the development of, but we now also have some enterprise products like Driverless AI and Q. And we’ll talk more about those in some detail later on. We’re relentlessly customer focused. So we’ve got customers around the world who we work closely with to ensure that successful delivery of their use cases. We’re global, as I mentioned, myself in Melbourne, but we have offices and teams across the Americas, Europe, Asia, we’re truly a global organization.
Well, that’s who we think we are but what’s the broader opinion? So there’s a slide here, which shows you the Gartner Magic Quadrant for 2020. We’re proud to be the only AI company in both the data science and machine learning quadrant and the cloud AI quadrant. And we’re recognized as a visionary in both. For the data science and machine learning quadrant, Gartner says we have the strongest, completeness division. And we’re very proud of that. And you can see some of our other strengths on those slides around automation, explainability, high performance and our components. And as we mentioned before, that excellent customer support. Today we’re going to focus on the Automation Components. But we’ll touch on some of these other things too.
So that’s a bit about who I am and who H2O AI are. And hopefully that’s given you a bit of an intro and you feel welcome at this meetup. But now let’s get into the concepts themselves. What is AI? So it’s big, it’s evolving. I’m aware you’ve heard this term mentioned in many environments. And there are many definitions and debates about what AI has been, is and will be. We’re going to move away from that debate and focus more on some of the practical applications that we see today. So straight to it, here are some use cases that H2O AI works on with its customers now.
So AI and in particular machine learning being the most prominent component of AI for business applications is transforming every industry as you can see some of the use cases across some of the industries that we work on, on that slide there, on the screen. It’s changing the dynamics of things like credit risk today through to marketing optimization problems. And really the limit of how this can be applied is our imagination. For example, in Australia, we work with customers on things like document classification, using natural language processing techniques. I was also talking to a prospect recently about predicting water quality for swimming. Will it be safe to swim at your favorite beach on the weekend? This could be a really cool problem to solve, and there were many others.
Sometimes a useful way to think about AI for your business is about going from being reactive to being proactive with your business needs. And there’s some examples on the screen there. But for your context and your business they may be different particular use cases.
And on the slide previously, we didn’t talk about health care, but it is a big focus for us, especially now. I work around COVID, sorry, I skipped ahead by mistake there. [inaudible]. I work around COVID-19 on some AI use cases you can see here on the screen, and we’re making a real difference in the fight against the virus with things like predicting staff resource requirements, looking at population risk, and supply chain optimization for healthcare products.
So hopefully that’s given you a bit of a practical flavor of how AI can help. But what is AutoML? So we’ve got the definition from Wikipedia, there on the screen. And to give it its full name, what we’re talking about here is automatic machine learning. And Wikipedia talks about it as being AI for automating the ML process. Internally at H2O AI, we like to call this AI to do AI.
But what is the ML process? So we’ve talked about automating that process, but what really is it? This is how we like to think about it. So we’ve broken it down here. There are various versions of this I’m sure you’ve seen around the internet. It’s fairly complex, and it’s often highly iterative. We go from one step which provides some insight or some new discoveries, which means we go back, we revisit a previous step or one prior to that. Really AutoML can be the automation of any part of this process. And it’s important to understand, what is being automated and how it’s being automated, when you think about AutoML solutions. Because the reality is, not all AutoML solutions are created equal. There are many players in the space, some relatively new. There are enterprise solutions, and they’re open source solutions. But they don’t all do the same thing. And they don’t all do it in the same way. In some cases, there are vast differences in terms of their capability. And we’ll unpack that a bit further as we proceed.
So hopefully, we’ve given you some ideas and context about what AI is, ML and AutoML. But the question remains, why would you want to automate components of the ML workflow? So I think, the first thing to consider here is that the world has changed. Most of you are tuning in from your homes, and I’m here broadcasting from my little study. It feels like I’ve been here for a very long time. So apologies, also, if you can hear the kids in the background, we’re all used to this endless Zoom meetings, endless Skype meetings, it’s the new reality that we live in.
And you can see there on the screen, there’s a picture that I took when I went to the supermarket, I think, a couple of months ago now, and sent to my wife, that we couldn’t get the pasta that she wanted. But that was strange at the time, it seemed odd. But now it’s commonplace. It’s part of the reality that we live in. And there are many images like this, which would have seemed strange only a couple of months ago, and now it’s very common to see.
But more broadly, our habits have changed, the economy has changed. And in the media, they’re often talking about how we’re entering a new normal. So as a business, how do you navigate this environment? Well, there’s a role for AI and AutoML to help you. I really like this quote, as we go through this unprecedented time, many of us reflect on how we can make this a better world, some of the advantages potentially of lockdown as well as the disadvantages. There’s an opportunity for AI here, we believe to affect real change in the world.
So that’s maybe the environment that we’re sitting in. But what’s the role here for AI and AutoML within this? Well, I guess, the first thing that we should note is that existing AI and ML models in production may no longer be working well. The data that they were trained on will be very different from the world that we’re living in now. And as we look to deploy models now, they will need to be closely monitored and retrained as this volatility continues, as we exit this lockdown, there’ll be many changes, I’m sure, in the coming months. Fundamentally, we’ll need to rapidly develop AI with a high accuracy, whilst maintaining responsible AI through good documentation and best practice, MLAI or machine learning interpretability techniques.
Critically, AutoML, when applied correctly, can help us succeed here. And we’ll go into this in some more depth as we go on. Before we do that, I think it’s important to think about some of the cautions here too. When we’re talking about AutoML, we’re not talking about replacing data scientists and we should be suspicious of tools that suggest they do. Obviously, the data scientist’s role has evolved rapidly over the last few years and it will continue to evolve. But fundamentally, we believe data scientists are essential to ensuring successful AI projects and transformation.
Data is messy, right? We’ve all seen that. And it’s often context and problem specific. So automating this workflow from the start of the project is probably unlikely. We’ve all, I think, heard the 80, 20% of effort. 80% is in data cleaning statistic. I’m not sure if that’s true across all projects. And in case some of the projects that we’ve seen, it’s less, some it’s a lot, obviously. But we should be aware there will be some manual effort here. Once we reach production, yes, it’s likely that we can automate that pipeline. But from the beginning of the project, good to be mindful of some effort in data cleaning.
However, we should also consider there are elements of this workflow that can be automated. And talk to us about some of the exciting work that we are currently doing around things like data augmentation, some really interesting and exciting products that we’re working on now and hopefully we can talk to you about those in some upcoming meetups too. And domain knowledge, right? It can be very specific to your business or use case and is often critical in the success of an AI project. Domain knowledge can be what gives your organization the competitive edge. So always look for an AutoML solution that allows you to combine your domain knowledge with the AutoML tool.
And finally, the last caution that I want to mention on this slide, it’s culture. So culture can be as big or potentially bigger than the technology challenge. So we need to be very mindful of this. I guess the positive here, and hopefully we’ll highlight this as we go through, by automating some of the ML components we can focus or you can focus as a business on creating that AI culture, embedding data literacy within your organization.
So with that in mind, it’s good to consider how AutoML sits within the broader framework for successful AI. And this is how we at H2O AI like to break this down. Firstly, we need to create a culture. Think about data literacy. Data is a team sport. Ask the right questions. Not every question will lend itself to an AI or machine learning answer. So be mindful of this. A favorite quote of mine from John Tukey, the father of exploratory data analysis says, “An approximate answer to the right question is worth a great deal more than the correct answer to an approximate problem.” So think about that as being very important for your project.
Also, connect to the larger community on AI. Personally, I could not have been successful in data science Without engaging in the Sydney and Melbourne data science meetup groups and being mentored by a number of individuals within those. That’s been fundamental to my development. As a company, being embedded in the open source community around the globe has been fundamental to our success too. And we believe in your projects as well, that will be important, engage with the community.
So finally, point four is when we get to data and technology. And there is a lot to consider here. Where’s the data? Should we run in the cloud? Where are we going to deploy? What other data sources? Really think through what’s optimal for your business. And it’s often different in different business contexts. We encounter this as we talk to prospects and customers in the region, some of which are public cloud, some of which still do a lot of work on-prem and combinations or hybrids of these, is very much the norm. So specific to your own business.
Also, point five, very important and the key area for us, is trust in AI. Think about explainable AI and responsible AI, understanding that models need to be trusted by the business for them to be successfully deployed. And there may also be a responsible element in terms of your model, if it’s treating sensitive groups in certain ways. But always remember that trust is a critical component to any successful AI transformation. And overarching, I guess all of this, is that AI still requires humans to be in the loop, to be successful.
So with that in mind, and taking the above and condensing it somewhat into a slightly simpler slide, perhaps we can think about the AI project in this context. So we start with the framing, the culture and community that we’ve talked about. The asking the right questions, then we’ve got the technical execution, that’s the do ML part. And then we move on to building trust and moving to deployment. If we think about this in a manual ML context, it may look something like this.
So why do so many projects not reach deployment? I’ve seen stats that vary from anywhere between 50 to 90% of projects never reached deployment. Who knows what the true number is, but it’s definitely a challenge. And I think any of us working in this field have seen that. But why is that? Well, this is something that I’ve experienced personally as well, as we go through projects. A lot of effort is devoted to doing the technical execution in the right way. And that’s a valid thing to do. However, it can leave very little space for having real impact, and your project is unlikely to be successful if you don’t devote enough time to that. So that deployment component, the trust component, the thorough documentation component of your project.
So hopefully with the right AutoML, what we will see is a rebalancing of those components, and allowing more space more time to have real impact in terms of our AI projects and transformation. So hopefully now we can see how AutoML might be useful for us. But let’s actually talk about it. So let’s have a look at some of the H2O’s platforms and how they can help us in this area.
So here we are, here are our platforms. As we mentioned before, we started with open source. Our intent was to create a thriving community for AI and ML. We built interfaces in R and Python and in interactive notebook environments. We optimized for big data, so Hadoop and Spark you can see through SPARKLING WATER there on the screen, and we learned a lot from engaging in that community and we saw a lot of the challenges that data scientists were having. They were spending a lot of time on feature engineering, in intuitive, time consuming task. And we knew that this could be automated to make data scientists more productive. And that’s where Driverless AI came in for us. We architected Driverless AI from the ground up and worked to remove those obstacles we talked about for data scientists. And much like open source, it builds on those capabilities, it works with Hadoop, it interfaces into Python and R and this is the product we’re going to demonstrate today when we get to that point shortly.
And our latest platforms too, to mention, H2O Q. You can see there on the right hand side of the slide. This is a new and innovative platform for business users. Ask your questions and get answers and model apps as well. How do we manage our models in deployment? Stay tuned for more on these products in up and coming webinars from us.
So focusing in on Driverless AI, this is how we think Driverless AI can help. So as we’ve mentioned before, it can help speed up the process, right? We’re talking about getting into a production ready state in hours versus this previously taking months. It augments your team, from your junior data scientists through to your expert data scientists, they can utilize Driverless AI to build a good working ML model. And it’s enterprise ready and it’s ready to scale with you.
So let’s dig a bit deeper on Driverless AI. So here, again, is a slightly more simplified version of the ML workflow and we can see where Driverless AI fits in. So we do our ETL and then into Driverless AI, we bring some data in which we can do data quality and transformation, modeling, model building, and go through that very highly iterative process. And we’ll show this to you in the actual tool as we go forward. Just unpacking and digging a bit deeper into what it’s actually doing, what is the automatic components of Driverless AI?
So what it’s doing, it’s doing feature engineering. So we’ve talked a little bit about the challenges here, this is a time consuming task. It requires advanced talent in data science. And it’s more than just simply dealing with missing values, but really going to generate new combinations of features that expose the best signal and ultimately improve model accuracy. It’s about model building, so which algorithm or which combinations of algorithms, how to tune parameters based upon these algorithms, and then as we move through to a deployment state, what ensembles do we produce that will make good predictions in the world?
Once you’ve done that, we’ve been through that technical process, we need to work with our development and operation teams to push this into deployment. So having pipeline generation artifacts, highly valuable. But again, how do we do this quickly with all important trust and responsible AI that we’ve talked about? Well, through Driverless AI we have a process of automatic documentation and thorough MLAI interpretability techniques that we can look at which we can use within business users to build trust in the models that we have delivered.
So as we’ve mentioned, it’s important to know that for H2O AI, trust and responsible AI are front and center for us. We believe that, but this is what Gartner says about us, we’re setting an example in this area. Also to the right, you can see there’s a booklet that you can download from our website to get some great insights into machine learning interpretability if this is a new field for you.
So we talked about bringing your own domain knowledge, and this is how we do it within Driverless AI. Through custom recipes. So you can see on the screen there, that’s some of the 140 plus open source recipes that we currently have that we can bring into Driverless AI. But remember, as well as these open source recipes, with our customers, we can develop recipes for specific use cases. So that might be a particular type of model that you guys like to use or a particular score which is relevant for your domain, or a particular types of data transformations, which makes sense within your use case. Or you can develop them yourselves and develop your own IP on top of Driverless AI. Our culture and our way of working is to work with our customers for them to develop their own AI and truly democratize AI.
So hopefully, what we’ve laid out here and becomes more clear is that Driverless AI is highly flexible and customizable to your problem. We’re now going to go and jump into a demo, which will hopefully bring this to life a bit more for you guys.
So before we do jump into the demo, just a bit of background in terms of the data and the use case that we’re going to use. So we’re going to look at a problem of credit card defaults. So very timely right now. I’m sure you’ll agree, and within the data set that we’re going to use, we’ve got some demographic features about the credit card holder, as well as details of the history of bills, payments, account statuses, and that type of stuff should be a good data set to demonstrate the capabilities of Driverless AI.
So if you just bear with me one second, we can just jump into the tool now. So when we come into the tool, this is the screen that we typically see, this is the dataset screen. So I’ve loaded this data that we just talked about as a CSV, but we have many connectors into our data environment. Here’s a few of the ones that we have set up in this environment. But we also connect to many other data sets. So if you want to connect to snowflake or to hive or, a number of different ways you can connect to data is available. In this case, we can have a look at this one here. So you’ve got this credit card data that we’ve just talked about.
And from the data sets page, the first thing we can do is actually, let’s have a look at this data and try and make a bit of sense of what we’ve got here. So as I mentioned, we’ve got some demographic features here within the data set. We’ve got information about the limit bands on the credit card and ID column there at the beginning, these payment statuses we can see here, we pay zero through to page six. Bill amounts, so what was the credit card bill, and the payment amounts here too.
And then finally, towards the end, we have whether there was a default in the next month. So you can see, predominantly, they don’t default, that’s what you would expect, and in some cases they do. So you can see from this screen, we’ve got a good summary of what the data is that we’ve got here. We’ve got an idea of the distributions and the number of records and some of the metrics around those continuous variables. We can also actually look at the data itself so you can see some of those rows within the data set that we have to really get an understanding of the data that’s loaded into the Driverless AI here.
You can also see, I’ll just click back to them. As I’ve mentioned before about this recipes approach that we have within Driverless AI, straight from here, we can modify this data by recipes. So we can use, I think code straight into here to make changes. Let’s say we wanted to do some transformations or drop some data or whatever alterations we want to make to the data, we can do it straight from here.
So the next functionality that I’m going to show you here on Driverless AI is our ability to quickly auto visualize the data. So typically, if you think about the EDA phase, within a machine learning project, this can take a long time. So in this case, we’ve got 25 columns, you think of the different combinations of visualizations that are possible, the list is very, very long and if you’ve used open source before and maybe in Python you’ve done a pass class which is being a massive diagram is very hard to interpret. What we see here is summarized those visualizations which are relevant to the machine learning problem that we’re trying to deal with.
So rather than looking at every combination of every plot, it’s servicing, intelligently servicing those that might be relevant in this use case. So things like correlated scatter plots, as we can see here, some correlation between bill amount one and bill amount two. We can download these if we want to also. For a more junior data scientist, we can click on the help here. You can get some ideas about what these visualizations are telling us or what they’re showing. Things like skewed histograms, we can see those two so we’ve got a number of examples there in terms of bill amounts. Generally they’re low with a few bill amounts within that. Through to the correlation graphs and you can see the correlation between each of the variables here. We can filter by just the high correlations if we want to or click on a particular variable and look at how that interacts with other ones. So quickly, we can get insights out of here as a data scientist and understand what might be relevant for our particular machine learning problem.
We can also split the data if we needed to here. I already have a train test split. So in this case, we’re going to move straight to prediction, the heart of the engine of Driverless AI if you like. So as we come into this screen, you can see initially we’ll get a little tip which we can take a tour. So for your more junior data scientists we can quickly get to know the tool. If you haven’t used the tool before it can be very handy. We won’t do that now in this case. So we’ve loaded up our train data set here, we’ll give this a name, let’s call it demo for AutoML and AI. We can select our target column here, default payment next month. So Driverless AI is already intelligently looking through our data and making suggestions about the AutoML strategy that we might want to take. So it’s told us a few things, a number of rows and number of columns for instance. We can add in a test data set that we had on the previous stage. So let’s do that for best practice.
So you can see down here at the bottom, this is some of the core of the engine that I’d like to take you through and show you today. We have these three dials in terms of accuracy, time, and interpretability. So the way we like to think about this, is we turn accuracy up, we’ll get some more complex components within the execution strategy over here on the left hand side and I can show you that just so you can see. So you see it’s currently set at five. But the final pipeline, we’re doing an ensemble of eight models. But if we turn that up, we’ll see some things change within this execution strategy over here. You can see now an ensemble of 15 models for instance. So we’ll just take that five for now.
Time, so this is really, we’re using a genetic algorithm approach to our AutoML Engine. And just by turning up time means that we will run for longer in terms of searching for the optimal model. And in the interest of this demo, and making sure this runs in the next half an hour for you guys, we might just turn that down a little bit and see.
And then interpretability. So the big focus for us. So as we turn interpretability up what we’ll notice, there’ll be less complex, feature engineering will be done, perhaps we might see some constraints added as well in terms of the modeling processes that we go. So we have those three dials there, but don’t panic if you’re an advanced data scientist, we also have much more control that we can allow into this execution strategy that we’re looking at over here on the left.
Now, the other things to note here, you’ll see that Driverless AI is identified, this as a classification problem. So binary one zero on whether we’re going to default on the next payment. We can make this reproducible if you like, and in this case, we’ve also got GPUs enabled. So Driverless AI runs very well on big data and many of the algorithms that we have within Driverless AI are optimized to run a GPU. So in this case, we’re going to use GPUs. And our scorer over here. So Driverless AI suggested we use AUC but out of the box, we’ve got a number of other scorers here. And as I mentioned before, of our custom recipes, we can add in other scorers as well and we’ll have a look at that in a minute as we go through.
So here, we have our expert settings, this is where your advanced data scientists can come in and really edit these as to their particular problem and really get to really supercharge the modeling process. So you can look at things around the experiment, the models that we’re using, the feature engineering that we’re doing. Let’s say we were doing a time series problem like the one I talked about earlier, in terms of water quality, we have a whole set of time series capability here, and natural language processing too. So an example of the document classification problem that I told you about earlier, perhaps there’d be settings here that you’d want to adjust. And then here are recipes. So again when we talk about putting your own IP into Driverless, this is the heart of how we do this, and hopefully I’ll be able to show you an example of this to make this come to life. So it might be specific transformers, specific models or specific scorers.
So in this case, let’s look at models. Out of the box, Driverless AI is using a number of models here. And there’s a custom one down now there that I’ve loaded previously, but in this case, let’s actually go to the official recipes that we talked about previously. So there it is on our GitHub, the 140 plus recipes that we have for Driverless AI. Let’s see what the number is currently. So we’re currently at 142.
And you can see a number here, lots of ones around data, we’ve got examples about how to write a recipe, if that’s what you want to do in your case. Then we’ve got specific ones for specific models, natural language processing recipes, Time Series, Scorers, as I mentioned before, so the number of different ones for classification, Binary and Multiclass, or Regression Scorers, Transformers, Augmentation, Datetime, and the list goes on. So you can see very many custom recipes that you can use of ours. These are all open source, or you can develop your own, or we can develop ones with you that are specific to your business problems.
So let’s go back up to the top. And perhaps we’ll take one of these simple model ones just for demonstration purposes. So lots of people like Random Forest as their model, so why don’t we take that one and bring it into our Driverless AI? So we’ll just pull the raw GitHub link from here, we’ll go back to Driverless AI and load that up. So Driverless AI will just run through some checks here to make sure that it’s coded in the right way to work within Driverless AI. Once that’s run… We’ll just give it a minute. I’ll just have a drink.
Now what we can see, we go back to here, to recipes. Now we have Random Forest in our execution strategy along with the other models that we’re using here. And you can see, over here on the left hand side, through others doing these other models, we now have Random Forest within your execution strategy. So imagine that, you can turn off, turn on, you can have full control and flexibility of Driverless AI in terms of what it uses for your particular AutoML problem you would want to solve.
So at that point, we’ll launch this experiment and we can see Driverless AI in action. So now the experiment is starting, we’ll see a number of notifications will come up. So this can be useful for your novice data scientists but also for your more advanced data scientists. Let’s say you forgot to check something in particular or for some reason you missed something, Driverless AI is going to let us know what’s going on. So in this case, it’s telling us that the data is slightly imbalanced. We looked at that when we looked at the details page. So we knew that already, but that might be useful information, we left the ID column in so it’s telling us in this drop there, if that’s not expected, then we can go back and change some of the settings to avoid that. It’s telling us about some shifts in the data between train and test. It might tell us if we have target leakage, or a number of other things within notifications.
So some good checks and balances there. What we’re going to do in this case, in the start of a cooking show, we’re going to move over to an experiment we’ve previously run. We’ll come back to this one in a minute, it shouldn’t take too long to run. So we’ll leave it running and come back. And let’s have a look at this one that we run previously and we can explain some of the outputs you can see here.
So fairly simple strategy. Again, you can see in this one, we did an accuracy of five, time of two, and interpretability six. And here you can see the iteration data that it went through. So each of these dots being a model that it’s built. So as you can see we started over there with a light GPM with 23 features, number of other models built here. You can see a GLM down there, that didn’t score it quite as well. And then we go forward through our AutoML strategy to reach our final model, which is over here.
Other things that are worth noting from this screen in terms of what we’ve done through this strategy, which is now again, only a couple of minutes, you can have this time turned right up and take a lot longer. But even in that couple of minutes, we did 100 and 210 features from our 25 columns and ended up with 23 selected. So imagine trying to do that kind of feature engineering manually or in notebooks, that alone would take a significant amount of time. Imagine on top of that, the models that we’ve been through, we reviewed and taken forward the best ones over here and tune the parameters, even within those components. We’re talking about a significant amount of time. And when we look at the experiments and how long that one took to run, let’s go back, that was just under five minutes of runtime in that case.
So other things that we’ve talked about, and I’d like to show you from this screen, things that we can do. So we think about this is a very quick model. And you probably wouldn’t take it to deployment, but let’s say you did or you’ve done some more thorough work and you wanted to move straight to deployment, we can do that. We can deploy locally in cloud or on cloud with this quick button here, we can download a scoring pipeline. So Driverless AI is very flexible in terms of how you can deploy your models. So with a C++ or a Java back-end, we can download and embed this in whatever process that makes sense for you and your business.
We can also interpret this model. So as we talked about before, the trust components, so we’ll set that off to interpret now as well. And I’ll just leave this to run for a little bit. This will take a little bit of time more also, as it’s computing the Shapley values, which can be computationally quite expensive. We’ll move back to the experiment and we’ll have a look here, at some of the other capabilities that we’ve got here. So the other one that I’d like to show you in this meetup is the ability to download the auto report. So we can click on this button here. That will bring us up on the screen.
So as we mentioned before, documentation for a data scientist, it can be a bit of a bane of their existence, to get to the end of a project, this takes a lot of time. You’ve been through a lot of technical effort. And then at the end, you need to document the final process that you went through. It can be very time consuming, right? So out of the box Driverless AI is documenting all of the technical components of that project. All we need to do, to make this a complete project documentation is perhaps add our business context around this.
And we’ve got a fully well documented process. You can see we have things about the performance that we saw, the settings that we used, system specifications and versions, an overview of the data, or training data we used, any shifts within that data, the methodology, assumptions and limitations, pipeline that we went through, carry on experiment settings we used. So you can see here there’s time, accuracy, interpretability ones we talked about, and the other ones here. More around the data sampling, the validation strategy, the models that we went through. So you can see some examples of those there. And the feature evolution itself, so we can see that and we’ve got some descriptions here, the types of transformers that were used in that strategy. And the list goes on.
And then our final model, we get an idea of what that pipeline would look like. So you’ve got a stacked ensemble here. There you go. So hopefully, what this will show you is a very thorough documentation from a technical point of view of the project that we’ve been through. Okay.
So we come back to Driverless AI here. Our interpretability is now run. So on this main screen we’ve got a good summary of what we’ve done, some of the important variables here and some of the surrogate models that we’ve built. Let’s dive in and have a look at those. Let’s hope they come to life a bit more. So we think Shapley values is very powerful interpretability technique that we can use these days to give us a real clear indication of variable importance and also direction. So you can see these yellow ones here, this is the global importance of different variables, just thinking back to the data that we had. And whether this would make sense, you can think about this.
So pay zero, so the most recent payment status seems to be an important variable in whether someone’s going to default on the next payment. Well, that seems to make sense, right? Also the limit balance, again, interesting, we might want to unpack that a bit further, but that also seems to be relatively quite important. So those are the global ones, but we can also look locally. So let’s do that.
So here we go. So we’re actually now looking, we’ve got the global interpretation, but also locally. So for the 10th record in our data set, you can see the direction and importance of different variables for that particular record. And also look at partial dependence plots. So pay zero, is this one of the critical variables here. So let’s have a look at that in terms of the average prediction that we get. And you can see there seems to be some step change between one and two. So that potentially very insightful for this particular problem.
We also do disparate impact analysis within our machine learning interpretability. So you think that sex is a protected variable. So we want to have a good look at this. Make sure that our model is not doing anything unfair in this case. So here you can see what about the true positive rate, is that varying greatly from male to females, or the false positive rate, for instance, and really get under the hood of understanding how our model is treating protected groups and making sure that it’s fair and unbiased in those scenarios.
We can also look at sensitivity analysis, so we can stress test our model and see how it will perform. So we can see here, we’re looking here at the classification of positive and negatives. We could say adjust variables here. So we could say globally, we want to adjust pay zero, our most important variable. What would the difference be here, if we say change that to zero? And then we rescore our data. Apologies, Siri just thought I was talking to her, so I’ll just have to turn off my iPhone back off, and see what the changes are here within the model. We can focus in on say the false positives or the false negatives, and really start to understand our model and how it’s performing and what it’s doing across our data.
We can also look at surrogate models for explainability. So we’ve got ability to do a number of versions of line or we’ll say decision trees are quite like in some scenarios, quite explainable to an end user, this potentially. And assuming that the R squared is relatively high, which it is in this case, we’ve got a very complex model under the hood. But this very simple model over the top does a reasonable job of explaining what’s going on, right? So we can use this in some cases to try and explain what our complicated model is doing under the hood. So you can see again, pay zero at the top of the tree, that seems to make sense, these lower scores over here taking us down this branch or otherwise down this branch over here.
And we can look at that all in dashboard form and start to build a story, both locally and globally in terms of what our model is doing. We can also download reason codes. So the reason… We noted the Shapley values before, but could we actually download a CSV from here which will tell us row by row, the weight of each variable within that decision that’s being made? So very, very powerful machine learning capability within the tool that we can see here.
Oops. So the last thing that I wanted to show you whilst we’re looking through Driverless AI and the AutoML capabilities within, is to look at the projects tab that we have over here. So we’ve created a projects tab for this credit card data, and there’s a couple of models that I’ve run previously here. And we can start to compare these. So let’s take these two and have a look. So how they perform, you might have a number of different iterations, we talked about that iterative process that you’re going through, and you can have a look at these models, the settings that were taken, the features that we went through. So you can see a fairly simple one over here on the left hand side, we did just under 200 feature engineering on this case, but in this case, almost 3000. So much more complex, longer model to run in this case, but what does that mean in terms of performance? Well, we can scroll down and really get a sense of that. The variable importance we can look through the confusion matrix for these and all of the metrics that we see down here.
Let’s go back finally, and have a look at that experiment that we kicked off at the beginning of the demo, you can see that’s run now, just seven minutes for that one to run. So there you can see, there’s the evolutionary process that we went through. Over here on the left hand side, the summary of the strategy that we took, down there on the right hand side and some of the variable importance. Also the various useful metrics for classification machine learning problems, you can see we’ve got our C curve in the area, under the curve here. We can look at our precision recall, lift, gains and chaos. So full ability to view this in terms of the validation matrix, or we can change this to the test set, we also had a test set within that so we can get a good sense of what’s going on even within the process right here.
So with that, I will take you back now to the deck. So hopefully what you’ve seen in terms of Driverless AI is the automatic AI and machine learning in a single platform. And as I mentioned before, the concept of AI to do AI. It delivers insights and interpretability out of the box. So you were seeing as we look to the auto these capabilities. As we looked at the very in depth machine learning interpretability techniques, we just touched the surface with those really. And it’s customizable and extendable with your own domain knowledge and your own IP. And again, we just scratched the surface, but hopefully you can see, it’ll give you a taste or a flavor of how you could use Driverless AI with your own models, your own scorers or your own transformers, on top of our models and transformers, to really make your own AutoML.
Hopefully that’s given you some interest in terms of how AI and AutoML can help in a business context. But also what role Driverless AI might play as a product. And if you want to find out more, we have a lot of resources on our website, a wealth of materials that we have there. As I mentioned before, we’re engaged in the open source community.
So we continue to do meetups like this one online, we have webinars, you can see there, and we will continue to do physical meetups again, when that becomes possible to do. We have… It’s part of our open culture, you can get a 21 day free trial of Driverless AI, so you can go to our website and actually get hands-on with the tool that we’ve just shown. Or you can take a two hour free test drive and look at our tutorials. There are some very detailed and in depth tutorials which are free to use, just on our websites. Or you can look at some of the solutions, some of the use cases as well that we have there. So many ways to engage and to get a deeper understanding of what H2O AI do.
And then in terms of Driverless AI, we’ve a thorough documentation too, here on our website, we’ve Stack Overflow community and a Slack community for Driverless AI as well as our open source products. So if you’re using our open source products, I encourage you to connect with the Slack community that we have to help with problems there. So many ways there to engage. We’ll get to questions in a minute. But before we do that, I just wanted to say thank you for your time and attention. We really appreciated having you attend this meetup. And if this had been a physical meetup, maybe we would have had pizza and beer so I’m sorry that we couldn’t do those. Maybe we would have had stickers. If you are based in Australia or in New Zealand, and you want some stickers then you can reach out to me, and I’ll do my best to send them out, but I can’t send you pizza or beer, I do apologize, through the mail. So with that, I will hand back to SK. And we might take some questions.
SK: Wonderful. Thank you so much James, for doing a great presentation. We have a few questions in the chat window, let’s try to get most of them through the time we have. The first one goes, how do you get real world experience in understanding practicality of AutoML algorithms?
James Orton: How do you get real world experience? Okay, that’s a good question. So I would say that a good thing to do is to try and build an ML project for yourself. And one of the ways that you can do that is through Kaggle. So we didn’t really talk about this in any detail, but at H2O AI, we have 10% of the Kaggle grandmasters working for H2O. So they input their IP into solutions like Driverless AI. So when we’re talking about that AutoML Engine, that has their IP within it. So as a company, we believe that Kaggle is a great place to learn, and it truly is. So get a project, get on Kaggle, get some data, and just try and see what you can do. And that’s the best way to learn, I believe.
SK: Wonderful. The next question asks if we can use the platform that you just showed with R and Python? And how does it compare to your immediate competitors?
James Orton: Okay. So yes, you absolutely can. So we walked through the GUI today, but much like our open source, we have a GUI called H2O Flow for open source, but we also have integration with R and Python there. The same goes for Driverless AI. So we have the GUI which is very nice to use, and I do enjoy using the GUI as well as the ability to call the Driverless AI from R and Python in your notebook environment that you’re comfortable with, embedded around the rest of the code that you perhaps do for data preparation or other things. Absolutely, you can use it within those environments too.
The second part of your question, how do we compare to our competitors. I mean, hopefully we’ve shown you that through the demo. I would say, take a look at Gartner, how it positions us and the things that it says are strong for us. And hopefully, we’ve shown some of those being things like our interpretability capabilities, our abilities to customize the tool, accuracy and high performance in terms of ML components. I would say try it and see for yourself. We have a 21 day free trial. Hopefully through that experience, you will see the capabilities that we have as a company within this area are very, very strong.
SK: Thank you. The question asks, is programming knowledge compulsory? And if there’s an alternative to that.
James Orton: Is programming knowledge compulsory to be a data scientist, I’m guessing the question is. Well, I think we touched on this a bit earlier. I think that the role of data scientist is constantly evolving, right? So if you think maybe three years ago, I would say that it definitely was and you should be a proficient coder in R or Python. And I still think it’s invaluable to have coding experience, there’s no doubt. But with tools like Driverless AI, you can build very, very good models without having that coding experience.
So let’s say within a team environment, perhaps it’s good to have a mix of people with good coding skills. And some perhaps you focus on other things like understanding the domain, understanding the use case. But yeah, I would say it is possible to have a career without coding. But it is still an important role in many scenarios.
SK: Thank you, James. The next question asks, can explainable models be developed using AutoML?
James Orton: Absolutely. So hopefully we’ve shown that through the tool that some fairly would be considered complex tree based methods in the past, now, there are very powerful agile, sorry post hoc, explainability techniques that can be used like Shapley values. So think about your complex model can now post hoc, there’s a lot of explanation that we can build in so that’s one route. But also within an engine like Driverless AI as an AutoML tool, you can say I only want to do GLM, right? So if that’s what you want to do, that’s the type of interpretable model that you want to build, then absolutely, use Driverless AI, use the engine, to develop features, to do tuning, but limit it to just some simple models like GLM. That’s absolutely possible.
SK: Great. The next question asks, how can machine learning… How is or can machine learning be leveraged for medical companies? Your quick take on that?
James Orton: All right, for medical companies. Well, I mean, it’s a good question. So a lot of our customers and prospects are in healthcare and medical environment. So we look to some of the use cases there particularly, we think about COVID-19. Really, the use cases, the limit is really the imagination and domain. So yes, around COVID, some of the really big challenges are resourcing, supply chain optimization, looking at patient risk, and other things, but then again, we have other use cases that we see and things like drug discovery across the board. I would definitely say, when you think about AI, and particularly AI for good health care, medicine, these are important fields and they’ll continue to develop and continue to be the forefront of AI.
SK: Wonderful. Thank you. We have time for one small quick question. It asks, does H2O.ai have any partners in Australia?
James Orton: Yes, we do. So we work with MIP, in Australia. So you can reach out to those guys, they partner with us. I also have me. So I am Australian based as well. So if you value the physical connection as we move into that world, hopefully, in the next few months where we can all come out of our little boxes and reemerged, you can reach out to me too. I’m here, I’m in Melbourne, but I travel around. I’d welcome the opportunity to go to New Zealand too, if that’s a possibility. So yeah, we’re here. We’re here, we’re on the ground. We’ve got partners, MIP. They’re a great partner for us in Australia. Yeah. We’re available.
SK: Wonderful. Thank you. Thank you, James, for taking the time today and doing a great presentation. I’d like to say thank you to everybody who joined us today. The presentation slides and the recording will be sent in an email shortly. Have a great rest of your day. Thank you.
James Orton: Thanks SK. Thanks for your help.