H2O TensorFlow Deep Learning Demo
In this demo, we show how to train a distributed TensorFlow Deep Learning model on a multi-node H2O cluster. We use a Jupyter Notebook loaded with the TensorFlow, PySpark and H2O (PySparkling) python modules.
- Connecting Spark to TensorFlow and H2O
- Setting up a Node Cluster using Spark
- Training TensorFlow to Model Data from H2O
- Using TensorFlow to Model Data as a NumPy Array
- Converting a Spark Frame to a TensorFlow Module
- Connecting a Weight Matrix to the Input Pixels
- Downloading a TensorFlow Model as a Java File
- Adding Logic to Weight Matrices
- Reducing Error by Training TensorFlow
- Reducing Error by Predicting Test Frame
- Connecting TensorFlow and Python
Arno Candel, Chief Technology Officer, H2O.ai
Read the Full Transcript
Hi, I'm Arno Candel. I'm the Chief Architect of H2O, and I'm going to walk you through an H2O TensorFlow deep learning demo.
Connecting Spark to TensorFlow and H2O
So first, you will need to install TensorFlow. Usually, you will do that with a pip install as a Python package. Then you will download Sparkling Water which by itself requires Spark. So in this case I have the latest version of Spark 161 and I have the latest version of Sparkling Water 165 installed. While installed just means for Sparkling Water that you have this folder and this folder is anywhere on your machine and the content of it is just sitting here and all we need now is an IPython notebook or a Jupyter notebook that is this notebook here in GitHub in our Sparkling Water Repository. And when we run this then this will connect both H2O to Spark and also Spark to TensorFlow and then back to H2O. So I'll, I'll show you more about this but let's first get this file onto our environment such that our Jupyter notebook can read this IPython notebook file. So the link to this file is here, same as the raw here. We're going to download that. And this is just a small little script and now we can actually start our Sparkling Water environment using the PI Spark front end. In this case, it's starting Jupyter notebook for us. And we are going to point it to this new script here that we just downloaded. And here we go. Now we have a Jupyter notebook that is the same thing that you saw on this GitHub page, except now it's live. So I'm going to make this just a little bit bigger.
Setting up a Node Cluster using Spark
And I'm going to start with the actual connection to H2O. So let's connect to H2O. In this case, we actually don't have an H2O cluster running for this demo. I'm going to start a fresh one here on our laptop here. And what it does is it launches a three-node cluster because master was set up to have three executors. This was an environment variable that I showed you earlier on the command line there. And what it's also doing here, it's downloading the MNIST dataset, both the training and the test files from S3. And it just shows us the status of the cluster. So it's three nodes, of course, they're all virtual nodes on my laptop, and it shows you that it parses these two files.
Now we'll tell TensorFlow to limit those three nodes and to run a Hello World program on these three nodes. That works. So now we have basically shown that we can run in a Spark context and we can parallelize over nodes and each node runs something in the map and all this from a Python environment. So this is the same as doing this from PI Spark. But the difference now is that we have the data inside of H2O and we can now turn this data from H2O into a Spark frame where it doesn't actually make a copy, it just exposes it as a Spark frame. But every time anybody asks for data, it's going to ask our H2O environment for the data. So you could have hundreds of gigabytes in H2O, possibly output of other models of ensembles, of GBMs, of GLMs of managing that we did in H2O joins, group buys, anything, parse data from Hive or SQL databases.
Training TensorFlow to Model Data from H2O
And now suddenly you can train a TensorFlow model over it with the data coming from H2O. So H2O gives you access to H2O's data, but also to make the parallelization of the model easier for you because we can do it internally. And also you can see the metrics of the models once it's trained. And I'll show you that a little later. So first we'll have to define the TensorFlow deep learning model. In this case, we set up a neural net with two hidden layers of 50 neurons each. And this is just one of many ways to define this model, but basically, we're setting up random variables for the weights and the biases. There are 784 input variables. These are the pixels of each of those handwritten digits. MNIST is a handwritten digit database with all 10 digits as potential classes that each instance can fall into.
So this is a 10-class classifier and we have rectifier linear units as activation functions, the softmax for output. And we use the gradient descent to optimize this problem using a constant learning rate. No momentum, no adaptive learning rate. So this is a very crude model, but it shows you that you can basically program TensorFlow anyway you want and still give it H2O's data. So how does it really work? Well, it does it in a batch where each batch is going to be given data using this next batch method from the data train. And this will all be defined a little later down the road.
Using TensorFlow to Model Data as a NumPy Array
Here is the next batch method. It will say, Give me random data from my frame as many as this end says, and also expand the response and normalize the data. So this is a little bit of data prep pre-processing, but that could have been done by H2O in this demo. We just didn't do it for simplicity. And as you can see, it turns everything into a NumPy array just the way TensorFlow wants it. So we are getting data from H2O and turning it into a NumPy array. And there's actually more data here that's coming. So you might wonder, how did this actually happen? While you haven't seen the code yet, the actual code that does it comes here, the row data for an iterator, this is going to be happening inside of a map, inside of Spark. So every partition itself will get this iterator and it needs to somehow make a row out of a disposition of the iterator.
And this row data is actually the method that will turn here the data into this NumPy array. But row data itself is given this iterator and the iterator is coming from this map partition's call that says, Hey, please run over my partitions in my Spark environment. And for each one of them, call me this train and end method, which is this method, which then calls row data inside. And what does row data do? Well, it gives me back this, this array here that's been filled up by H2O, but also it, and actually, it's not H2O for you here, you see self, right? So this shows basically that this is coming from the Python environment, which itself got it from the Spark environment, and Spark environment got it from H2O. So all these, these direct graphs are active at this point.
The data eventually comes from H2O, but here it looks like Python is giving it to TensorFlow, but Python itself got it from Spark, and Spark got it from H2O. Then the magic is that we are here basically filling data for Python and it's all serialized by the whole infrastructure of the PI Spark environment. And PI Sparkling just exposes it to H2O. So this is not very H2O specific, but it shows the power of the integration. There's very little code that has to do with H2O, and yet you can access all the data that's in H2O on a distributor cluster. So this is happening on every machine. It's going to train this model and here is where it actually trains it, in create, and, and it builds a model on this data for this many iterations.
Converting a Spark Frame to a TensorFlow Module
And using this batch size and create, N was defined up here where it actually builds the model and then it initializes the variables, and then it runs it and it runs it for this many iterations. And each one here is getting a batch from this data using this next batch method here. So if you look at it a few times, you'll, you'll make sense of it some more. But what really happens is that, let me run this now, let me define this and then we'll see what really happens. So we are here, they're converting it into a Spark frame importing TensorFlow as a module by defining these methods. And now we are defining the training method and we're actually running it too because this is where it runs it. So now you should be able to see here that it's mapping over the data in the Spark context and each one is actually calling Python inside.
You see here, Python already the result is back. And now we can see what the results look like. Well, this collect method here at the end is actually collecting all the output of each of these maps. And you see that we have three of those. So that means our three nodes each have collected their result and each of those results has six things in it. And the six things that are in it are being returned by TensorFlow here. That's the model. It's the three weight matrices and the three bias weight vectors. These are just coefficients and we want to see maybe how big those are.
And we are also going to average them such that we can take the three models that we got from these three nodes and we can average those weights and biases. So that's what we are doing here. We are computing the average, we're dividing by the number of nodes, and we are then turning these weights and biases into H2O frames and these H2O frames. Basically, at this point, we're uploading data into H2O from this Python console. So it only works for small data, but no neural model. The neural net model is actually so big that you cannot copy it. These are only a few megabytes. So in this case, we upload a few kilobytes from this environment here in our Python console to the H2O cluster, which just happens to be on the laptop as well.
Connecting a Weight Matrix to the Input Pixels
But you see here, the first weight matrix has a dimensionality of 784 times 50, then 50 by 50 and 10 by 50. That's the output, 10 neurons. It's just that the fact that it's transposed is just that due to some internal bookkeeping in H2O to make it more efficient. But you can see it as just a matrix connecting the input pixels to the first hidden layer and then the first hidden layer to the second hidden layer. And then the third hidden layer, sorry, the second hidden layer connects to the third layer, which is the output layer with the 10 neurons. And for each of those hidden layers and the output, there's a one-dimensional bias vector. That's just how deep learning works in this case. Fully connected feet forward neural net. So now that we have these weights back in H2O, we can create an H2O deep learning model that is created from these weights. So this is just filling up the model state from the TensorFlow output.
And we're not really training, we're passing epoch equals zero, which means it's not training at all. And we also are telling it that we want to compute the variable importances. So if you looked carefully at the very beginning when we started H2O, it said you can connect to this cluster actually where this user interface is, when I click on it, you'll see here flow. This is the familiar user interface that H2O offers. And now I can look at the models that we have. There is this model and I can actually inspect this model. So it hasn't trained at all, at least not as far as H2O is concerned, but I can still see the parameters. I can still see that there are two hidden layers of 50 neurons and we were given initial weights and biases. And now we can actually look at the variable importances that come out of TensorFlow, the confusion matrix.
Downloading a TensorFlow Model as a Java File
We can look at the status of these weights and bias values. So you can see that the mean weight is somewhere around zero, which makes sense. We also have two hidden layers of 50 neurons, 784 input neurons, and a total of 42,000 weights in biases. So about half a megabyte of model size. And a good thing here is that you can look at this model that came out of TensorFlow as a Java file. So this is now a Java version of this TensorFlow model with the entire logic coded up in one flat file. And you can't see the whole thing here because it's a little big in this preview. So we are going to download this real quick so you can look at it better. So the instructions are right here. So let's ask this H2O server to download it.
Adding Logic to Weight Matrices
So now we can inspect the code and when I go to score zero, that's where the actual logic happens. So it'll say, fill up my predictions with zero, fill up my hidden activation values with zero. This is the input activation I should say, sorry. And it's going to fill up the input. It's going to fill up the activations with zero here, and then it's going to do the forward propagation in these four loops. So it's going over all the hidden layers and it's going to go over all rows and rows in this case are the neurons that connect each other. And then there's another one that is the columns, which is another dimensionality of this weight matrix. So it's a two-dimensional weight matrix that's being walked over to add up contributions to the output. So this is a matrix-vector multiplication that's unrolled by hand to make it a little faster with partial sums. And in the end, you see the activation here is exponential. For soft max, this is the rectifier here. And at the end, you'll have a small external piece of logic that was included in the beginning.
Let's go up here in this file, in this external jar that has just some logic that says when you will make a label please look at the thresholds and decide whether its prediction is such and such for multinomial, it'll just pick the biggest probability across all 10 classes and assign that as the digit that was made. All right, so let's go back to the demo. So now that we have this model created in H2O, we can actually look at the performance also here from this Jupyter notebook, it's the same confusion matrix we saw on flow. It's just slightly differently rendered hit ratios as well. They are also available here in the flow output. And you could also ask for the confusion matrix's entries. So in this case, I'm going to ask for the last error, which is the overall error. So the 71% overall error, it's pretty bad, it's not a great model, as we can tell from this confusion matrix here. Many diagonal values, but some of the classes are pretty good already, so this just hasn't converged yet. Now the beauty of it is not only can I get this model here in Java form, I can also continue training it inside of H2O. For example, I could of course get it back to TensorFlow and train some more in TensorFlow flow. But in this case, I'm going to train it some more in H2O.
Reducing Error by Training TensorFlow
And as it's being trained, I could also inspect it from Flow. But first, we're going to look at the overall error here. So now we're down to 8% error after a few seconds. And of course, if I refresh here, you will see that the log loss went down. The variable importance has slightly changed and now the diagonal is much more exposed here. 8% error after training with one more duration inside of H2O. So we started here with TensorFlow that trained the initial model and then after only one epoch, so only 60,000 rows later of training, we are down to 8% error. And of course, you could keep training more and more. You can continue training this whole model here and it will then get to a much better error. So the way you would continue training is you can just say here, deep learning, you can set the training frame and now you would have to point out which frame you're training on. In this case, it's this one because I didn't give it a name. This is the one where the response was set to an enum. The original one that was loaded here had the response still as a numeric value. And now I can set the actual target to be the last column here. That's the response column. I can set up these 50/50.
`And let's say we want to train for 10 more iterations. Let's say we want variable importance is two. Now I would set the checkpoint to be the name of the previous model and everything else we can leave default. And I say build model. Now, this is continuing to train the model that we just had above. I can look at this model. It's still at the same state. It hasn't actually started yet. It's still preparing the data, but now it's training. You can see here it's training at several thousand samples a second, I can refresh it and you'll see the accuracy here getting better. So now we are at the log loss of 0.2 instead of 0.3. And now you see it as it starts to converge. This is still training metrics, but we can do the same with the validation set. If you had provided one, then you would see twice as many output plots. And let's look at the confusion matrix. So down to 4% training error, but we'll, we'll make sure we'll have a test set error as well at the end. So we luckily have a test frame and in order to know what the test frame was, we can also go up here to the previous model.
Reducing Error by Predicting Test Frame
So we make sure that we have the same training frame. This PI one, that's what I remembered from earlier. That's why I used this for the second model, which you can both find here. I used the same frame. And remember the test frame was called test. So now we are done here. It built 10 Eppo and it's going down nicely. So now we can predict on the test frame and it'll automatically do the categorical conversion from the numerical response into a categorical. And now we can look at the predictions themselves or we can look at the metrics that come out at the end. And you can see that the test error is now 4% instead of 8%. And the actual predictions you can look at like this, you can see the data frame or you can see the per-column statistics. So you can download this prediction set or you can export it and so on. The data itself looks like this. These are the predictions. So very confident predictions. And of course, you can do all of this from Python as well. You can ask the model, you can ask to get the model if you only knew the model's name. For example, in this case, the model's name was what, This is the name of the model.
Connecting TensorFlow and Python
You can go to Python and you can say, M is a H2O get model. And I'll put this here. And now we have M. Now I can make another cell and this will show us the status of this model with this new confusion matrix of two and a half percent training error. And we could now do the performance on the test set error. I think the file was called test frame. Okay, so now we should get the 4% back that we saw earlier in flow. So you can see that flow and, and Python are synonymous because they both talk to the backend server and you can do the same thing from R as well. But this Python interface was used to get the connection to TensorFlow going because TensorFlow doesn't have an R API, at least at the time of this recording. So I hope you found this informative. And we have many plans to make this even more intuitive and easy to use and extensions to image models and speech models and sequence models with convolution neural nets and LSTM's recurrent neural nets are definitely on the roadmap and we're going to make it easy for people to use TensorFlow in a distributed environment to train models and to score on large data sets in production using H2O
And TensorFlow, and possibly other deep learning packages. Thank you so much for your attention and I hope to see you soon. Bye-bye.