AI in Snowflake with H2O AI Cloud
The Snowflake Data Cloud can be used to train H2O.ai models as well as score models at scale. Using these two technologies enable organizations to use models directly within the warehouse, so that both users and applications have access to real time predictions. By unlocking the predictive power of the H2O.ai models, many new and valuable business cases become achievable.
3 Main Learning Points
- Different ways to use the data within the Snowflake Data Cloud to build models.
- How to deploy and score data and when to use different approaches depending on the SLA requirements.
- How to use the H2O.ai Snowflake Integration Application in the H2O AI Cloud to deploy and monitor models.
What we're going to be looking at this morning is using H2O cloud with Snowflake. And we use that in a couple of different a couple of different ways. The goals for the session is to provide a high level overview really of what the integration is about. But how do you really use that integration to build models to put them into production, so they can be scored against data and in real time or in batch? And with both of those approaches, you know, there are different SLA requirements around how you, you know how you put them into production.
So if it's a real time, you might want very fast latency for the prediction. If it's a batch operation, you might have several very large models that can take a long time to process massive amounts of data and Snowflake. And so we'll give you some ideas on how to look at different ways to put those into that environment. And then also, we'll look at the integration application that is in H2O cloud, that helps us to deploy and to monitor these applications. So we'll be touching each of these points, as we go through this morning. When we look at the ecosystem of Snowflake, it really touches each of the points that we try to look at when we build some integration and deploy models.
So of course, it starts off with training, and you know how we take the data from Snowflake, generate features to create, you know, very accurate models. But that's only part of the, you know, part of that journey. It's not until the application can consume that in production, that really a business starts to get value out of the models that have been created. And so the deployment, part of this is a really strong story. What we want to be able to do is make it very simple to deploy the models, but also have a way to monitor them and ensure that they're doing the, you know, the correct type of predictions that the data and scientists team envisioned for that model. Integrating the models directly into Snowflake is, you know, one of the very powerful things that allow us to use both technologies from both Snowflake and from H2O, AISI as well.
And that integration allows us to move the model directly into Snowflake as a as an option. And that's the user defined function for UDF. But also put them into a mode where they can be called externally through the external function. And that can be useful for several different types of applications and things that Snowflake may might require. But all the way through, all the features that we talked about, are available for, you know, for getting explainability using API's or UIs, to generate the models and the end the outputs that we're going to see. Now, one of the things that's quite interesting is what the experience has been with customers that have used the integration. The key point for me is really on the last column, you know, things that used to take hours, and now being done, you know, in minutes. And that's quite key, right as, as the market is, you know, very competitive in most industries. Now, the idea of actually providing results just in time, at the right time in the right place, is key to differentiating yourself between other competitors. And the way the integration works allows us to do exactly that. We can call these models in real time because they literally look like a SQL stored procedure. That means that anything, can Snowflake, or any application that uses Snowflake, can get these predictions in real time. One of the things that one of our other customers told us was just how easier it was to inference or score, the data and the models in Snowflake. You know, previously, you know, it was a complicated process of extracting data, running it in maybe a different environment and then uploading the results back again. What this allowed us to do with our integration was removed that step, making it very simple to just put things directly into the environment to score without any other
integration or offloading of data. And that was key for foreign FUTA. Now, you know, we're going to talk a little bit about how we access the Snowflake data here. When we go to start training, the first thing that we will actually do is start with pulling data into Driverless AI to train. And there's a couple of different ways that we will do that. We'll look at using this from a H2O drive. And that's a common storage location that we have inside of manage cloud, it allows us to take data from Snowflake, put it into a location that can then be used by our auto ml engines, we're going to look at Driverless in particular today, we can also load data directly into Driverless, you know, perhaps you're doing some training on prem, or maybe you're doing some in the cloud. But you can also use Driverless AI to access the data using the Snowflake connector. And then, of course, as I've mentioned earlier, if you don't want to use a user interface, we can do all of this through a notebook as well. So, with Python API's, we're going to look at a couple of examples of how to actually do this directly as we go through. So, let's start with a short drive. It's an application that resides inside of managed cloud, and enables us to really easily share application data across the Driverless engines on the auto ml engines that reside in the cloud. The way we do this is by creating a configuration that can easily be set up inside of managed cloud, and then use that to import the data. This UI provides a freeform SQL panel that we can then take our SQL into. And that allows us to access those tables directly into Drive. So, a quick example of what this looks like, is with the with the panel here, I actually previously have created connector. And you can see on the on the darker panel here, I have a profile name of Snowflake, my user information, my account information to access the Snowflake, data cloud that we have, and then what warehouse I will pull this data from. Once I've created that set of credentials for this connector, I can actually go ahead and pull in data directly. And so, I do that by issuing this query, as I said, we have this freeform panel. So, you can sort of see there I'm doing a very simple select from a table with Lending Club. But the query could be much more complicated, you know, I could be doing a join from maybe two different tables I can be doing, you know, some very complicated things, including pulling data from, you know, an external table for example. The other field there is where I want this data to be stored. So, it's going to be stored inside of drive. And it's going to be called Snowflake H2O Lending Club. So, it's very easy just for me to find it in the demo.
Alright, so we're going to do a poll now. What is the key to using closely integrated platforms within your organization as the ability to easily use functionality unique to each platform, the ability to enforce security and organizational data policies, the reduction and time to move a model into production? If it's other and you're comfortable sharing what other features, you'd be looking for please write them in the chat as well.
All right, we're going to give everyone a few more seconds to get some answers then. All right. Let's see the results. Looks like an even split but the reduction in time to move a model into production is definitely the most popular.
Yeah, that's interesting stuff. There, you know, security is obviously very key with everyone. And what we've done with the integration is allow the, the credentials that are used by Snowflake to, you know, to govern what data you can use for training. And then, of course, when we go to train, we're pulling that data and building a model on it. On the deployment side, when that model goes into production, of course, you know, there are security policies around that still. And, you know, the production piece of that, that the whole reduction in time is actually a key part to the integration, which we'll look at later on in this in this session, because, you know, making it easy for a data engineer to now consume the model, you know, they might not necessarily be familiar with what columns you've used, or, you know, what features are actually being generated, or even how to go and use it directly inside of Snowflake. So, we spent a lot of time actually concentrating on on that. So it looks like we've got some, some pretty good, some pretty good alignment with those pieces. Let me carry on here. Okay, so that's a really interesting feedback. Thank you for everyone that voted there. When we use Driverless, you know, what we're able to do is pull that data into, you know, into us for, for training. And there's some things we can do in Driverless to make sure the data is being parsed correctly. And so there's a there's a, there's a field called fields options enclosed by and so what Snowflake will do is they'll put quotes around fields that have strings in them. And I put a little annotation here on the on the side, because it's a little hard to see, but it's a single quote, double quote, single. It looks a little funny when you put it into the, you know, when you write it out. But that's, that's how it looks. And then also, if it's No, how does that get handled? Now in most environments, and when you go to pull data, you will need to specify your, your account credentials for Snowflake so that we access the data that you're allowed to access your account. And one of the things to remember is to actually see these for next time. And so if we look at the Driverless UI interface, one of the things that we're doing here is actually pulling in some, some data to train on, you'll see here under the under the tab, we have a Snowflake option, which is the Snowflake connector. This is where we will type in our credentials. And also those formatting strings. And then just like in Drive, we have a panel where we can type in our SQL. Know, it can be much more complicated than the simple select, but it enables you to pull that data directly you can think of it as we're receiving the result set from the from the query. And then there's a checkmark to remember this, so you don't have to retype things for next time. So once we've got the data within sight of the inside of Driverless, what we want to be able to do, of course, is to train these models. And, you know, we quite often will see data scientists want to use other approaches and other than the UI. So for example, things like, you know, going through a notebook, maybe using snow Park as a way to do this. Because you, you might very well want to do some data prep step inside of snow Park and use some of the some of that functionality in snow Park to leverage the scalability of the data cloud to prepare the data, maybe you're going to, you know, average a set of columns or do some data, some data prep work for you actually give the data to drivers or to drive to train them.
The other thing that you can actually do is if you're using the Snowflake external function integration, is you can actually initiate training from a select statement. And so you sort of see a select statement there as an example, where we basically issue a select with a function called H2O train. And we're going to tell it we're going to build a model. We specify the table name In the target column, and we can specify some of the other options that drivers will you will use. If you're familiar with some of the Driverless options use, you can select things like time interpretability accuracy. And so you can define those in the definition as well if you wanted to. Clone is actually an option where we use Snowflake to basically snapshot the data we're going to use to build this model. And that's really useful for industries that are heavily governed or regulated, that require, you know, the actual data that was used when you know, when a model was trained. And then lastly, what will this model be called, once we've actually finished the training exercise. And so pipeline dot Mojo is actually one of the default name. But you could put something there that was more meaningful, and you know, in line with the business, maybe risk model or customer churn dot Mojo or something like that. Either way, whichever way you choose with the rest of the UI or the API, the model accuracy the same, the amount of time that we spend training is the same, just whichever interface you're more comfortable with. So let's look at this example of using snowpark. As I mentioned, you know, a lot of data scientists are super familiar with doing things in Python, sometimes they will actually prefer doing that. Just because of the level of control and familiarity they are with, with it over the different tools. But what we start off with is, you know, first of all, this, this is actually auto generated. And I'll show you how to do that actually, in the integration, our application, we can generate code for you for a particular model that you can then use to, you know, either score or build or use the model in some, you know, in some fashion within your environment. So this is auto agenda and see on line nine, we're actually pulling in the Snowflake snowpark, library snowparks, a really amazing API, it enables the data table that we're going to access the data basically look like a data frame. So it's super, super familiar to data scientists that uses Python. But it makes it very simple to consume the data we need, we're actually going to set a couple of things here on line 16. Basically, we're Driverless is sitting in we're using drivers on prem in this example. But we're going to deploy through into the cloud and pull data from Snowflakes data cloud into us. So we have to tell where as we have Driverless is living. We also need to specify exactly where we go and get the data from our Snowflake account. So lanes 21 through 28, basically just help to define the, you know the credentials and the warehouse that you want to use. We can sort of see that, where we where we start using the snowpark API's directly. Line 32 is where we actually start to now get a session to Snowflake that we use to grab the data. And we see on line 37, that this really is a data frame. And what we're going to do is take that data frame, we could do some more interesting work here. As I've mentioned, we might want to manage some data, maybe enrich it with some other data, I have maybe a file. But in my case, I'm just going to do a really quick to CSV. And then we're going to push it into drivers, which is what we see on line 42, where we actually connect to Driverless and then push the data up into a dataset that Driverless has access to see their online 43.
Now, if you're familiar with the way Driverless trucks, the things we call building, a model and experiment, and so we're going to preview an experiment, which will allow drivers to look at the data, maybe set some defaults, tell us what it's going to do. And that's what line 49 through 57 order, it's going to basically going to tell us what it thinks we should be doing with the state or the types of models that's going to include there's a contract override, you'll notice there on line 56, where data scientists that was familiar with the API's might want to specify certain options. I've skipped this for the moment and On line 59, I actually go straight to creating the experiment that will build the model. And on line 70, we see we're now going to measure these results. And these results allow us to see how accurate the model was. And at this point, the model is now ready to go ahead and be deployed, which we'll look at in a minute. And so deploying models, as I said, is actually a choice this this numerous different SLA requirements that you have for scoring. And so when I go to deploy models, some of the things I want to think about is, you know, where do I want this model to live? Do I want to live inside of, you know, the Snowflake Data Cloud, which will be a UDF? Do I want to just call it externally. And I might want to do that because of the SLA the type of language the models written. And so in my mind, I kind of, you know, have this little decision tree if you like. And I think well, what is the model written, if it's written in Python, it's going to be an external function. Because the Python library has a lot of native libraries that need to exist. In order for the models to be scored. An example of using an external function might be, you know, I've got a manufacturing plant that's using Snowflake. And they're taking pictures of images of components being built. And they might want to actually make predictions on whether those components appear faulty based on an on an image, that unstructured data and the image needs a number of native libraries that execute outside of Snowflake for, for their prediction. And that would be an external function. The external functions also great for real time. And it's kind of counterintuitive than that. The external function actually is highly concurrent with multiple connections, connecting to the endpoint to score and make inferences, it can do that very, very quickly, you will find several examples in other webinars where we show, you know, several seconds to score, you know, 10s, of 1000s, of, of rows. But the UDF is really very, very powerful. And so if the if the UDF is available, and I've got a mojo that's going to score some structured data, and a Mojo is a model object optimized, it's a Java artifact that we that we produce, then putting the model directly into the UDF, basically moves the model to where the data lives. So there's no data movement. And it will scale the same way that the warehouse scales. So this is a super powerful way to get the model into that environment as well. We'll look at both of these as we go through the example. So you know, here, what I've done is I've uploaded a model, we'll see this in a demo in a moment. But we can actually upload directly into Snowflake, this goes to a staging table, that is hosted in Snowflake, and I'm given an option, I've selected the UDF. Literally at that point, I'm able to use the model directly for scoring. And what this allows us to do is to basically have a sequel call, maybe I'm doing you know, an update function or a select
to select some data, I can have the model as part of that SQL function call. And you know, when you think about the power of the strike for any application or downstream process, it now means that those applications can get the data from Snowflake and a prediction that is current based on that data right then and there when they do the query in older ways that would have done this before, would they have would have extracted the data, done some processing, created some predictions on the data and then upload those predictions back into the environment, that Snowflake environment. The problem is, is that that's great on day one, but seven days from now, that prediction could be stale, right, because the data has changed. You know, there might be new account information, there might be a new number of SKUs are available in the warehouse new risk factors if it's a risk model. So being able to access the data in real time, and get the predictions then in there when you need them is really what the power of the scoring integral ration brings you we can score that through the SQL course, or we can score it and schedule it through the UI. And we also provide a API that allows us to score them directly and inside of the Manage cloud environment. So really quick look, before we look at the, at the demo, when we go to score a model that was previously uploaded, one of the things that we go through and do is we auto generate the query based on the model. And we do that because you know, a data scientist is quite often handed off the production steps to a production team, maybe an application programmer, a data engineer. And rather than having to go through and explain what columns that are needed, or how to call the model, we generate that for them. So it's very simple to then take those examples. And either implement them directly, or add any tweaks they need for their environment. So that it's simple to consume the model, and reduce that time to production, right, get that time from reduction, shorter, more concise. And, you know, here's that example of calling this through the Manage cloud, what we actually do with this model is we make a call, you'll see here on line 11, through 16, basically, all we do is you tell us the model you want to call the file, and that was a, that could actually be either a file or a data frame. And if you don't have a number of rows on line 40, which is the entire file, and then on line 20, we actually run this asynchronously to go off and call and score and get the predictions back, that enables us to scale, you know, very, very quickly, and handle massive amounts of calls that are being made to the score. Now, as I say, really, you know, once you go into production, getting into production is only half the problem, right? You know, we've now got it into production, we want to make sure that the model is actually operating the way that it was intended for this business problem. So we want to collect metrics that enable, you know, a operations team to understand, you know, the latency if there are any errors. But we also want to gather statistics that the data science teams can use, so that they know that if there's data that shifted, if the calls are using the wrong columns, there's things that are missing, maybe the data's piles differently. And then operationally, we also want to understand what the cost of that execution is. So we gather those metrics and make them available through the tooling as well. You can see here, I'm actually looking at the integration app. And I'm looking at a Java UDF. I'm looking at a risk model dot Mojo. And what we've seen is on this from between the start and end date, how many times the model was invoked,
how many rows were actually score processed by the bottom model, what the credits used within the Snowflake environment was, and then the average latency for each of those calls, this is a Nanos rates 3.4 seconds to do this. And that would have been across the entire table. So making that data easily accessible, especially over different date ranges. So maybe you want to compare, you know, data changes from this month to last month, or a different version of a model. All that data is there, and you can query it through the tool. So let's look at this. Take a quick look at what this what this looks like from the from the tools that we've that we've mentioned. So I started a copy of Driverless earlier as well as the drive and the and the Snowflake app. So let's start with Dr. We looked at this earlier on, and what we had, what we had done was we have looked at the internal connector itself. And so this was a way for us to look at the connector. Make those configurations changes, and then pull in the data. Now one of the things that we can do at this point too is we can also refresh the data as well. So if I'm pulling in some data, maybe for some monthly cadence of building a model, I'm able to actually do this through this tool. Now, if I've gone through and built the model, with Driverless gives me the option of either pulling the data in directly from drive and see here, I can pull it in from Drive, you'll see that the data that I pulled in on the Lending Club from my query, and drivers have been all here. Or I could actually go through and pull it in through the Snowflake connector as well. And sort of see here, because I clicked remember the savings, everything is here, all I need to do is re execute this as, as needed to pull the data. And once I have the data, I'm able to visualize the data. Look at the details of what's been pulled in, let's look at the data very quickly to look at the shape of the columns that have been brought in. These will be you know, some of the some of these will be turned into features for the model, some of them will be dropped. But I'm able to do that through here. We're not going to go through the building step today. And there's a number of presentations around that. But what I wanted to do was look at what happens once the model has actually been built. And so one of the things that we want to be able to do is to take this, take this model, once we once we're happy that it's been built and validated by the data scientist, I can download either the Python scoring pipeline, or the Mojo and go ahead and deploy that into the environment. Each of these have different characteristics. So for example, if it's a an image model, it's going to be a Python pipeline that will be generated. If it's something like some structured data, depending on the models and some of the settings that are used, it could be a mojo, in addition to a Python scoring pipeline. And that's one of those first things that are used to decide how I want to go ahead and deploy this depending on what artifacts are available. So in the Snowflake application, if I sign if I sign in to this environment, what I'm able to what I'm able to do is to pick the approach that I want to use for scoring. So the first thing that happens when I go to sign in, as I'll be asked to pick a particular warehouse, I'm going to put it into the Lending Club. And I can type these in or I can select them from the drop down, it's, it doesn't either work through that through the tool.
And what I can do now that I have connected is go ahead and deploy the model that we that we selected. If I go to, to browse a model quickly, just find one on my desktop. What I can do at this point is just pick up a model.
And I will find a really easy way just to take this model. And I'll just drag and drop it onto the environment.
Now, when I've uploaded that now what happens is it asks me how I want to deploy this model. And again, depending on my SLA and the you know, the type of model artifact that's available, you know, Python, or Mojo sort of helps direct me to which one I want to use. I'm going to use the UDF because it's very simple to go through and deploy. And I'm going to I'm just going to call this tests dot Mojo. And now I've got one already, and I'll go ahead and deploy this. Now what we actually do behind the scenes here is we upload it into the staging table. We define a function For this model, add in the integration code that's needed automatically. And deploy it once it's deployed. It's available now to anything inside that Snowflake environment. And that enables us to very quickly get into a production state, but also enable other applications to use it. And that's it, it's actually now deployed, and, and live. Now, one of the things I might do at this point is want to pass this to a data engineer. And so the data is new, you might want to do some data prep or a program, I want to maybe call the model and a variety of different ways. And this is what auto Gen is used for. So auto Gen enables me to pick my model. And you can see we've got a number of models here, let's just pick this risk model we started the demo with, by list the artifacts that are available here. What I might do, for example, is I might want to give it to a data engineer that uses DBT. So we can generate a DBT resource. And that will be available for them to download and use a materialized view of this model. We might, for example, want to generate a sequel worksheet or a Python notebook or snowpark. As we saw in the in the slides earlier, that's all available, all available through here, right number of different options to pick up different snowpark. It's no sequel to consumer model that will allow me to do everything through the API to predict or even schedule predictions. But you could also go through and make those predictions here as well. So if I pick up, if I click on predict, and I pick up the model that we looked at earlier, what it's going to go through and do is enable me to score on a particular model, or set when that should happen. Now, one of the things that we want to, of course, go through and do is once the prediction has happened is how do I go through and pick my table service, pick a Lending Club table we want. And what we want to ensure we do is get the monitoring data around how that prediction is occurring. And that's, that's key, because we want to ensure that the model is scoring directly the way that it was intended. And that's what we have with the monitor piece here. So I'm just going to go back a week or so a couple of weeks, pick up a model, I can pick the model that I want to score on, or get the details or save the scoring. And what we do now is get an understanding of if there were any errors in scoring time, or if there are any things that that happened, maybe data transformation problems with the data. And so you have a little explanation about what this monitoring does. Basically, this was invoked 35 times how many rows the credits, there were there were no errors that were that happened. But if there were errors, these are the things that the data science team would need. An example of an error would be no, the data was, you know, it was when after it was passed by Snowflake there were, you know, 20 columns, and only 19 were expected, right? So we know there's some data shift, some data formatting issues going on there.
And so that will, that will be here. And of course, all the errors between those dates are available through there as well. So that shows us how to go through and do that deployment piece, which is which is really key and monitor all the data. So let's just let's just recap of what we saw through that demo. So we can see how easy it is to pull data into the Manage cloud to train. And then of course, that deployment piece, and scoring at scale, being very simple to do, regardless of the way that we want to do any of that scoring It's so critical that we provide monitoring data so that everyone knows exactly how the model is performing. How you know how, how long it takes to score, if there are any particular errors that are occurring. And, and the cost. That's really quite key to the operationalization of those models. You know, the last thing I'd invite everyone to, you know, try this out, try this within your environment. There's a link to a free trial. It's very simple to get started to Snowflake AKMS. Also there as well for people to try. When you sign up into the into the free trial, just type in Snowflake and you'll get all the all the pieces that we've looked at through today that you can walk through and build your own models and see how it works. So I'd like to thank everyone for joining us this morning. Hopefully this has been useful for people to get an understanding of the level of integration that we have with the Snowflake Data Cloud.