Arno Candel - Anomaly Detection and Feature Engineering
Arno Candel takes us through an Anomaly Detection and Feature Engineering tutorial. Don’t just consume, contribute your code and join the movement: https://github.com/h2oai User conference slides on open source machine learning software from H2O.ai at: http://www.slideshare.net/0xdata
Arno Candel, Chief Technology Officer, H2O.ai
Read the Full Transcript
Next on the roster is a very popular algorithm. It's called anology Detection, right? Everybody does that. Outlier detection, again, quoting Toki has been the fundamental grain where people wanted to find out where you have the most widest perent in your data, and how do you find out and how do you change, how do you account for that? We'll bring back Arno for that. He doesn't need much introduction, so don't get Arno again. That's all I can say.
All right, welcome back. Hope the break was long enough for you to get your legs stretched or something, your fingers extended. Alright, enough of this yoga. So we'll kill all the H2Os and we start again.
Just briefly, if anybody wants to have their VirtualBox at a better resolution, you have to download the VirtualBox additions CD image from the Oracle website. And then you have to tell VirtualBox to mount that as a CD inside of the VirtualBox environment. And in there you have to run a Shell script. So it's a little bit elaborate, but for those experts among you, you might be able to follow those instructions. So at that point, you reboot the Linux insight with the updated kernel patches, so to speak. And then it will be fine, but it's not trivial. So I'm not going to show everybody how to do this, but if you're interested, we can help you tonight or so that for the future, you can have it in a better resolution. We can probably work out something in the back corner over there later. So if anybody wants that, please approach me later.
So now we'll get back into our studio and we'll look at anomaly detection using Deep Learning to find outliers. And that is part of unsupervised learning. So we'll have to open a script called unsupervised something. Let's go look, H2O training, tutorials, unsupervised. There's also clustering and dimensionality reduction, which we don't have time right now, but it's also not that much code in there. It's basically just running K-means and finding the optimal number of K, the number of clusters, for the Iris dataset, which is three. And that example will show that it can find that number three. And then there's dimensionality reduction using PCA, which will show that for the EMNIST data set, the distribution of spectral components such that about after 50 or a hundred of those main components, principle components, you have the bulk of the dataset understood.
It explains most of the variants of the dataset. So it makes sense, right? If you have your 784 pixel digits that I mentioned earlier from EMNIST, you don't need all of them to explain the structure of the data, which is just whether it's a zero or one or two or three or four or five and so on until nine. You might expect a little more than 10 though, because it's not just 10. There might be little other structures in there. It doesn't need to know that it's just 10 digits and nothing else. There might be little blobs of people riding a little ugly or whatever. It has to be explained as well. So basically, since all the sevens and threes are not all the same, you need more than 10 eigenvalues basically, or eigenvectors to explain that space. But let's say 50 or a hundred should be reasonable number.
So assuming we know that from the dimensionality of reduction code, we can now go into anomaly detection. And with that prior knowledge, so to speak, we can start. Even if you hadn't known that, it wouldn't be that bad because it's still a reasonable assumption to say that you can reconstruct EMNIST from roughly 50 things if those 50 things are chosen right. Okay, that's the basic idea here. So anomaly detection with Deep Learning uses the feature of Deep Learning that is unsupervised, which means there's no need for a response column. You don't need to know whether it's a seven or a five. You just learn the structure of EMNIST, for example, or any other dataset.
You just learn the structure, and how do you learn the structure? There's a little picture here. If you look at the preview html basically showing a small little neural network. It's not even deep. It's just a single hidden layer here. But that hidden layer has fewer neurons than the input. And the output is the same as the input in terms of number of neurons, which means EMNIST comes in with 784. I can press it down to let's say 50. And I say, alright, I want somehow, out of these 50 numbers, I want to reconstruct my old 784. I want an identity function being trained here through this auto encoder. And there's a lot of things that can happen here. You can add noise, you can add some kind of constraints for sparsity and so on. That's a big field, but let's just say the goal here is to reconstruct EMNIST digits through going this waste that's narrow where we have to live with 50 numbers. We only get 50 numbers that are stored in those 50 neurons and those somehow have to make up the original information back to the 784.
And if you have that concept in your head, that's what you need to know what an auto encoder is, okay? So any questions at this point? The goal is to learn a neural net that does nothing else but make the identity function. But it doesn't get to just say A equals A, but it has to say it's a function of those 50 things and those 50 things are just numbers that are derived from the original 784. So somehow they have to be compressed together and then we can decompress again. So decode after the end code. Yes?
I just have an H2O question. So it says our <inaudible> is already in use, so how do I reset H2O?
The easiest way to reset H2O is to kill the cluster and start a new one.
Yeah, how do I do that?
You go to the desktop. Sorry, the question was how do I restart H2O cause there's something saying this frame is already in use and there's some locking thing going on. So there's a couple ways to handle that. So that's the wrong window here. I need to be an inside. On the desktop here, there's a kill all H2Os in red, so I can kill that too just to make sure there's nothing running right now. It's just a kill all.
Okay, so what we are going to do here is we are going to load the EMNIST dataset again because it's so intuitive and people can follow it. And then we are going to try to find the ugly digits, okay? Outliers. Those that don't really conform to the dataset because they're rare, they're not that often seen. You don't see an ugly seven as often as a nice seven. That's the idea. So what we are actually going to do is we're going to train an auto encoder on the training data set on the 60,000 and we're going to say this is normal data. All those digits are nice. That's what you should learn. And once the system learns these 60,000 digits, the structure of those, through this narrow 50 neuron thin neural net, after that we can give it the test data and say, reconstruct these using this auto encoder.
And whatever comes out at the other side is hopefully again the same as the test data set, except that of course we never saw any of those digits before. So it means if you're not able to reconstruct it, it might be something different. Let's say I give you 784 random numbers, it's not going to be able to make a nice digit out of it, right? Well actually it might make a nice digit out of it because it's all it knows, but it's going to be ugly obviously. Cause the input is a little different. The output is going to be all garbage. So it doesn't make itself up because the model doesn't know how to behave when I give it 700 random numbers. And so the idea is if the data that comes in is wrong or different, then it doesn't know how to reconstruct it. It only knows how to reconstruct the nice digits, so to speak. So we give it the test data, we see that the reconstruction is totally different than what we gave it, and we compute the reconstruction means square error. And if it's big, we conclude that it's an outlier. That's just a simple model of outlier detection. Obviously you can do way more with K-means, PCA, auto models, classifiers, this and that, but this is just to show that auto encoders can be useful for some things. Yes?
Does it ever make sense to run an auto encoder with more than one hidden layer?
Yes, that's a good question. The question is, does it makes sense to run an auto encoder with more than one hidden layer. And yes, usually what you do is you build a stacked auto encoder, you build one layer and then you freeze it, and then you build the next layer and then you keep adding a new layer that you modify, that you learned <inaudible> for, but you freeze the old ones. And in this case, when we say here's the three layer auto encoder, it actually doesn't freeze any of them. It trains them all from scratch as if it was a neural net with 780 output classes, so to speak. They're just output numbers. So it's a regression problem with 784 targets at once. And that's not easy to train. So that's why typically a one layer auto encoder here is a little better or converges faster than if you have a two layer, three hidden layer model, but in theory it should work if you give it good parameters, but this hasn't been solidified yet. The next step will be to make it step by step, like layer by layer and then it will be much better.
The question was, is there heuristic, how many neurons to pick? What's the dimensionality for those inner layers that are less than the original input feature dataset size? Well, no, not really. I mean, you can use PCA to get an idea of how many components, like typically what's the igon spectrum of the whole data set and say, all right, let's say fifties would be reasonable. But I'm not an expert in auto encoders yet, so I'll just show you that it basically behaves like an auto encoder and it's programmed to be an auto encoder and the applications are manyfold and this is just one application. So let's just see how well we can learn EMNIST. And for this, we have to close this. Let's see, how do I close this? All right, So let's load H2O.
I killed my cluster outside, so it should be a fresh one. If you look at the log, it'll say one second old. Yeah. And this is Maxwell version eight just started from H2O four course for allowed course. If I hadn't specified this minus one here, it would've been just two allowed course. That's such that the cran accepts the package as something that is testable on a system with other stuff. They don't want too many cores running at the same time. So we have to basically pass the end threats argument now to make sure we're using the full machine.
So let's launch. Double launching doesn't hurt by the way. You can launch it 10 times it and it doesn't do anything new. It's already exists, it just connects to it. All right, so let's load those data sets. By the way, when it imports a file, it does it in parallel overall notes overall course, right? They're all busy loading the data, different offsets, different bite offsets. It cuts the file into pieces, and then you just read the characters. And as soon as the fans finds a new line, it knows from now on that's my line. So it's pretty smart. This whole par set is really fast and it's hard to find a faster parser for big data.
All right? The predictors, again, are the first 784 columns and the response is 785. And because it's unsupervised, we just get rid of the response, okay? There's no way we can by accident use the response somehow, but it's just a number to us anyway. The auto encoder will not do much with it.
So now to find the outliers, as I said, we have to train an auto encoder first to say learn the structure of this data. And the auto encoder is nothing but a Deep Learning model where the response is ignored and where the auto encoder flag is set to true, it's the only difference. Everything else is the same, except that a few options probably will not work. So it wouldn't make sense to say stop at the classification error of this accuracy of classification, right? Doesn't make sense because it's a regression problem in this sense, it will train and tell you an MSC as it trains. So it's almost like a regression. If you say classification equals true, it'll probably say it automatically ignored that and set it to false because it doesn't make sense to do classification. So let's see what happens here.
I'll say one hidden layer of 50 neurons with a 10 H activation function. And I don't ignore constant columns. That's another option here because EMNIST has a bunch of zero pixel values that are on the outer corners that are always empty. There's nobody riding on the leftmost upper pixel, right? So it would just ignore it and say there's only 717 active features, but in this case, we want to use all of the 784 going in and going out so that we can plot it. We want to plot the digit 28 square.
Okay, the model is built, we can look at it real quick. It says it has a 0.02 training means square error. In this case I said infinity for the validation error because that really is the worst mean square error you can have, just like a hundred percent classification error was the worst you can have. But that just means it wasn't computed. Just to be fast, we can do five apox, it doesn't matter. Actually you can play with that. You can also make it five neurons or make it 30 or a hundred or make it two layers. You will see that this script is pretty cool, later we'll visualize the outcome of the model. And if you start playing with it, you'll see different outcomes. So it should be very intuitive to get a feel for this whole thing.
So now we are doing what I said earlier, we are taking the test set and we are making a reconstruction and to do the reconstruction, the anomaly app, which is called H2O The Anomaly. We'll do this under the hood. It will basically take the data set if it has categorical numbers values in it, features in it, it will expand them with one encoding.
It'll horizontalize them. So let's say you have a data set with a thousand columns, but there's a lot of categoricals in it. It might become a hundred thousand in the end. So this will be a hundred thousand neurons compressed to say a thousand, go back to a hundred thousand. So you don't know what these a hundred thousand values mean. They're just a bunch of ones and zeros for those categoricals. And then standardized values for numericals, all these numbers are to you meaningless, but the system knows that the original input features that came into the neural net, the hundred thousand expanded numbers and the ones on the outside, on the outgoing layer, those can be compared. And there is a mean square error for those. And that means square error is the reconstruction error. And that's the one that we're actually asking the system to tell us like how close is input and output, but they don't necessarily mean anything.
If you give me cat, dog, mouse and income of a million, I'm not going to tell you the difference is a fish, right? Doesn't make sense. So it all has to be numbers. So keep in mind that this anomaly reconstruction is the inner representation of the neural net. That's why we call it H Anomaly. And we can just say H2O predict or something and then make this reconstruction. There's a way that H2O predict actually does the reconstruction will show that later. But for now, what all we care about is what's the reconstruction error? And the anomaly app gets to the error. That's what you care about, the reconstruction error. Alright, so H2O anomaly, I can show you that alone before I put it into this data frame.
Actually let's just run it because it's the same thing anyway, I'll just pull it into R at the end. So the test reconstruction error, we can now look at and there's a reconstruction error for these few first six digits.
And they seem all to be roughly average, but the third digit was a little worse already than the the first two. So now you wonder what happened there, what is that digit, right? And in order to see those, we actually want to extract, well actually there's two things we want to do. First you want to visualize it, but we also want to know what are those 50 numbers that you made up for me? Maybe you're interested in those features. Maybe you want to compress your data into features that are non-linear, not PCA, but non-linear PCA, right? Something like that.
You want to do a feature reduction of your dataset using Deep Learning. So this is just to show you here how you get those features back. This deep features tool says give me layer one. That's which hidden layer you want the index of the hidden layer given a data set and the model, an auto encoder model. It basically just makes this whole reconstruction, but it throws away the reconstruction. Now you only care about layer one, which is the middle layer. You could also ask for layer two, we will get the output, but in this case you get layer one.
That's the 50 features. If you look at the summary of those, you see DF, C1, all the way to C50 and DF is just the deep feature output. So these are 50 features now made by the auto encoder model that was trained on EMNIST and whatever that means to you. You can say I want to build my GLM on that feature space or a random forest or something. If you don't trust that Deep Learning will get the job done. But I don't know what this is good for other than it's the feature space that you have in there. You have many more applications that I'm not aware of. I'm sure that you can use this, right? If I needed to do a dimensionality reduction, I would use PCA or this. And I haven't done extensive studies to see what's the difference, which ones are more useful for modeling or not. But this is basically how you would get there. I would love to hear from you how you can use this.
So now we want to plot the actual reconstruction. We still don't have the reconstruction. What I did earlier was the reconstruction errors, right? With the anomaly app that gave us the error per point. What's the mean square error of those inner neuron network layer MSEs? But in order to visualize it, we actually need the reconstruction, the full 784 numbers at the end. So before we go to this actual plotting logic, let's just go over this fast cause you don't need to know this, it's just a helper function. Let's now make this reconstruction itself. So I want to get 784 numbers back and this is the predict function of this auto encoder model. Its prediction is the outputs that happens at the output layer. After giving the test set, what comes out is the prediction and we just overload the same function predict, right? The anomaly model here had also a predict function. No, actually not, it wasn't predict, it was called anomaly. Yeah, we had to come up with the right naming convention. So I guess the anomaly app does nothing else but create a frame. But the auto encoder can be used to be given to this anomaly app or it can be used to predict in which case it makes the reconstruction. I hope it's clear enough how the convention is. So let's say predict here is is a meaningful name to actually make the 784 dimensions.
I hope you agree with me. And then summary of this will show 784 reconstructions and it's called reconconstr_something. So that all these features are actually reconstructed versions of the original features and they're not necessarily from zero to 255, they're just whatever this network ended up doing. Hopefully it's close to the original data space, but of course it cannot because it's only 50 neurons. It might not be enough to get everything. So now the little helper function I sourced earlier are actually good enough to plot what's happening here with the simple call of this function. So we have the reconstruction, I think that came out and we have the actual error which is plotted on top. So later when we see this, it should be fun to watch what happens. Just have to make this window a little bigger so you can actually see something. Apologize for the screen resolution. All right, so let's plot the first batch. Here we go. So now it plots the reconstruction. So the outcome of the H2O predict function on the auto encoder and the test set, these are the 784 numbers that are predicted by auto encoder and it also plots the reconstruction error on top. So the first one is 0.008, which is the smallest reconstruction error. So this one was easy to reconstruct and it's not surprising to see that the top 25 numbers that are easy to reconstruct are ones because it's not much you can do wrong with a straight line, right? You can imagine that that can be learned fast.
So now let's see what the original digits were. Instead of using the reconstruction, I used the test data set here, it says test. So this is the original, it's the same numbers. So these helper functions basically say, give me the reconstruction error, the list of 10,000 reconstruction errors, and then take the top 25 and find me like out of those 10,000, which are the top 25 smallest reconstruction errors. And find me those <inaudible> of those points and then visualize them. So these helper functions are pretty smart inside to just do that. But you can see that there's not much difference between this one and this one. And you can always go back and forth here.
So far so good. Now let's look at the ones that were the good, now we go to the bad and then there will be the ugly, as you can imagine. So the bad or the median reconstruction errors are like this. These are the ones that are in the middle around 5,000 out of 10,000 points. These are the median reconstruction errors and you would expect those to be reasonably easy to reconstruct but still have some errors in it. But this is kind of the, what you see in a regular population of digits that's standard digits, right? These are all easy for you to recognize. Maybe the two on the right side is a little harder, but most of them are easy. And of course in your environment it looks different, right? Does anybody have anything else but ones in the first batch? Depending on what model you're building, you might get all fours there or all sevens. I've seen that before. But in this way it's all ones.
And this one should be roughly similar in your case, but of course you get different digits. And now the ugly, that's the last 25 reconstructions. So these are all pretty ugly. You can't tell what they are, right? They're already bad. Let's look at the real numbers. Oh yeah, nice. So you agree with me that this is not a nice three, that's not a nice eight, that's not a nice two, that's not a nice two either, right? These are not pretty. So this was automatically detected by the auto encoder model as something that just doesn't fit the normal data set structure of the training data. So if you're really good, you would actually take the training data, do this whole exercise, and then take the good ones out of the training data and say that's their actual model.
Like train another model on only the really good ones because right now we learned that the training data is good and now look at the test set and see what's different. So all we identified is digits that are different between test and train, but not really what's an ugly digit, right? We just saw what's different, what's not in the training data. If training data was all ugly digits, you would then say, well they're all fine. It's all ugly, so to speak. So ideally if you really want to a sophisticated model, you would have to find the good stuff first in the training data because you can self limit and say just take those that actually match the model. Well the training points that have a lower reconstruction error already are like the nice data points you can iterate until you have a model that really represents itself, its own data really well. And you could train for example, on network the intrusion or something. You can say this is the stuff that happens all the time and you really just learn the good stuff. And as soon as there's something that's a little different, you get the alarm bell ringing.
So we would be very interested in seeing more about this, but I think that concludes this little demo. Does anybody have not ugly digits in their last ugly piece? It should have worked for everybody, but everybody's model should be different. Okay, great.
So now we can move ahead to the advanced section. I guess this was part of it. Let's close this and open up the file that's called tutorials advanced. And then we have three models here. Let's do the features first because that's the easiest one. So let's look at the HTML version real quick to get an overview. This is doing a little bit of feature engineering on the adult data set. So who has done feature engineering in their lives before?
Feature mining, feature engineering. So you obviously need to sometimes change your data. You can't just take it as it is. You have to like make a couple modifications before you can model. And eacH2O is a really good platform to do in memory transformations, whether it's the log function or a conversion from integers to factors or whether it's a sum of two other numbers or a thresholding or something like that. You can do that kind of stuff easily. You can say if this number is more than this and this it becomes five, it becomes 17 or something, whatever you want to do. But this just shows you how to deal with a bunch of column names in string form like here. You can give the data some names, you can do a summary, you can say that my response is the income and then we can do an actual loop over model building.
So we can then say, now that we have this data set, let's build a model and see what happens. And then change the data set and again, build a model and see what happens. And now the model is not just a simple model, but it's actually a whole list of models. So this is a little helper function here that says, cut my data into three pieces with the random uniform distribution that was first generated. So you make a vector of random numbers zero to one, and then you say, if the number is more than 0.8, it's this one. If it's less than 0.8, it's this one for example, that would be split into two pieces. And here we do it into three pieces. So we say less than 0.8, more than 0.8, less than 9.9, and then more than 0.9, right? You have three pieces, 80%, 10%, 10%, and they're all random rows out of a data set. So it cuts it up into three sets. One is training, one is validation, and one is testing. It sets the predictors, basically takes all the column names of the frame and then removes the response.
So there's a way in R to do this with the formula notation with <inaudible> and all that. But this is just our explicit way of specifying the columns X and Y. And we are doing this here we are saying data is my little helper list that contains some arguments and that is just a little helper function that Spencer and I wrote last weekend. So don't take this for something that it's not. This is more like a little helper function. It's actually sourced here. This is the actual code, the binary classification helper. We can look at that later. But all it does is give you basically the H2O fit and H2O leaderboard functions. And this is only for binomial models. So for classification with two yes or no or zero one. And what it does is it says give me data x y train validation and how many folds you want.
In this case it actually computes and infold, it makes a great search for example, depending on the parameters later, it makes a n fold model with grid search, if you specify, tells you the n fold accuracy, right? Scores, models based on the n fold cross validation, then scores it on the validation data set also. And at the very end it also scores the winning one on the test set. So you can have like your whole pipeline optimized for some test set if you want. Of course you don't have to do it this way, this is just to show you that you can do these kinds of things. Basically it's an R tutorial where you would do more than just one model.
So now the question is what does this H2O fit do? Well, it takes a name of a model that you want to build. In this case it's a GLM and a GBM. It takes the data which is x y training validation on how many folds, and it takes a bunch of parameters. The parameters is just also a list. You can give any list that would fit a GLM here and pass that in and that will be it basically. But you're not really passing it here, just set it as a global variable in R. So it's ugly, it should be a nicer function and all that, but just bear with me.
You'll see later how it works. And once we then specify what these parameters are for GLM and GBM in this case like this, we can just run the train models function here on the data. It will split it up into three, run GLM and GBM with those parameters and then score it using the H tool leaderboard function that's in here that was also specified in this helper code here. So let's see how this all looks in reality. So now we are going to the features code. We are still connected to the old machine, which is fine. EMNIST doesn't bother adult and vice versa. So we are loading the data.
We assigned column names to what they actually are because the adult data set here didn't have column names. You do a summary and you see there's a lot of stuff here with categoricals and integers and real value. There might be some real somewhere, not sure. Now it's all integers and categoricals. Okay, fine. This is a real data set with some missing values and some integers and some categoricals. So now let's make a summary of the income column because income will be our response, right? And income as you can see is either 50,000 or more or less basically. And there's 37,000 have less than 50,000 income and the 11,000 have more and 50,000. So slightly imbalanced, but you can say not too bad. That's just the data set. So now we are going to get these magic helpers in, is that the end folds, the two in interest of time here, otherwise it takes to long to run in this virtual image. And then we learn this function called train models. So now R knows what my function is.
And from now on all we need to do is specify those parameters and then we can change to parameters, we can change the data set and we can just keep calling these best model equals train models. And you'll always see what's the best model so far given your new data set, right, where you added the feature or you change the feature, you remove the feature. So this is basically now how the workflow will look like. So let's set these parameters and just run it. It ran the GLM and now it's splitting the GBM. And here we have the output. Let's make this bigger.
N Fold Cross Validation
So the output said we have a GLM that has a 0.9 AOC on the training data and a 0.9 AOC on the validation data. So it looks like it doesn't over fit, it's just roughly the same, but no way, that cannot be right. It's a small data set. So what happened here was N fold cross-validation. Remember, we had twofold. So this is actually not the training error, this is the N fold cross validated training error because you specified N fold. So that's why these numbers are similar. So it's easy to fall into these traps, but you have to know that because you specified N fold, the training error that's reported is actually in this case, it's that number because I wrote the function H2O fit the initial leaderboard to do it this way. The model itself still reports training error as the actual training error on the training data and validation error on the end fold. Okay? I hope that's not too confusing. So, and I also have a test set which is at the end to get scored on the test set here. So when I call it train AOC here, I made that convention to say that's the end fold training error.
But that's just because that's my function that does it this way. I chose to print it this way, right? I could have said N fold training error or something. And I think this whole function only works if N fold is actually enabled. We didn't make it generic enough to work for all kinds of stuff, so you don't need grip search, but you need N fold. So you can see what's important here. The important features are intercept of course, for a GLM it's just a basic offset fine then capital gain, marital status, married on whatever that means. And then the country being Columbia and being married or being a spouse, something that, and the same for GBM, having education, having age, all this stuff matters for your salary in the end, right? So it always seems plausible. Obviously we don't have anything else in this data set anyway, so they all matter. Alright fine. So we have a 0.9 roughly, and GBM and GLM are the same. So this is our baseline doing nothing on a data set.
Now let's just do some super simple feature engineering. So we'll need this function called append, which does nothing else but append with C bind a column to a frame and then store that under the key appended frame. Just keeps overriding appended frame each time and returns that new frame. So it's just a little helper function, you don't have to use that, but I wanted to write it to a specific name here so that I can clean up the KV store from time to time and delete all the last odd values. The automatically generated temporaries that are made in R, we want to get rid of from time to time. Otherwise you get too many of these. And when you delete them, you might accidentally delete the stuff that came out of a C bind because they're also temporaries. So when you want to keep the stuff around that comes out of a C bind, you need to call H2O assign, which is basically saying give it this name and this name should be something if you don't delete later with your H2O RM function, the remove function.
If it doesn't make sense, then don't worry. We'll get back to that. But basically from time to time we like to clean up all the temporaries and this regular C bind makes temporaries. So the returned frame has a name that is called last dot value 17. And then when you later delete all the last dot values with the regular expression match, it's gone. Right? And that's why we call it appended frame in this explicit assigned call.
But that's just a formality to be able to clean up. I could have made the demo without cleaning up and it would've looked nicer, but then later in production you would've run into problems because your KV store would blow up. So it's a balance between making it look nice and making it work. So now we appended A as factor of the H. So the H is no longer a number. Now it's actually a factor. So 18 and 19 are not just a little bigger than the other, but they're actually different things. So the column names has now another H here, but this time it has a zero appended to it because you automatically start numbering them. I didn't say what the name should be here of this new column. I just set append with a, which does a C bind inside. And there was already a column called age. So now we have age again and it renamed it to make it work. So the summary says age zero has now 1,348 people of age 36. That's the mode here. That's the most of that age.
Okay? And before the old age column here at the beginning, where is it? Age here had minimum 17, maximum 90, and a bunch of people in the middle, the mean and median, are around 38. That makes sense. All right, fine. Now all we need to do is call this train models again one time and that will run the cross validated GLM and GBM on this new data set. Does the threefold splits first and does it all in the background. And I will see what the numbers are. So we'll see here that GBM is still a 90 point something and GLM got a 90.9. It's almost 91 now. It's a little better. Before, let me go look. I think I printed it here. It was a 90.2, now it's a 90.9, so it jumped a little bit, 0.7.
And the test set is also 91 now. So you can say, okay, fine, we got from 90 something to 91, maybe. It's all noise of course, because the data sets are small. So GLM has a benefit from that. Let's go look at what the top important features are for GLM here. GLM is the upper model, these are the features that matter the most. Okay? Now that we gave HDS factor treatment, it became a factor, which means age 18 and 20 and 19 actually matter. And it looks like that they're not making as much money as others because their coefficients are actually negative. We can go look at that by looking at the normalized coefficients of GLM and sort them.
These are the most important ones and these are the least contributing to a high salary. And the other ones are the most contributing to a high salary. So these are positively correlated. Capital gain, whether you're married or not, whether you are a wife and whether you're from Hungary seems to be a dataset specific thing here. Maybe this was sampled in Hungary, you don't know, right? But basically this is what it says. This is how you get to be making more money and to lose money you have to be 18, 19 or 20 or 21 or to not make as much, right? And that also makes sense. So this is something that we only learned after we made age into a factor because if you said it's a linear fit to age, the more age the better or the less age the better.
Neither nor right the best is when you're 45 or something. So it's not, or maybe 60, I don't know what the best is, but it's not necessarily a linear relationship. But now it became better because you have all these little factors that you can make a linear dependency. So that's something that GLM now benefited from. GBM already was able to cut the space up earlier because he can already cut up and say here, less than 20, more than 18 or whatever. So GBM shouldn't have benefited from it and it didn't, but GBM is not necessarily a good model yet because we didn't give enough trees. And they'll see later that when we give it more trees, it also gets better. So now let's focus again on what we can do with GLM by giving it more factors. So let's do the same thing with hours per week and capital gain.
And obviously we can run it and they'll just easily get another model that now has three more columns appended to it on the data set. And we get up to, what is it, 92 now? 92.6 or almost 92.7 on the training N fold. And this is validation 92.8 and that's the test set 93. They're all fairly close. So this is a 93 model and for those of you who know adults, 93 is about as good as it gets on this small data set. So this is the good model, right? And all we had to do was ask factor, a couple of these columns. So let's give GBM a better shot. We just change GBMs parameters and run it again. And now we have an interaction depth of 10 instead of five, I think to default and 50 trees instead of just 10 or 20. So that helps. Now we can make up those interactions itself. And the answer already is written here. Both algos will now have a 92.8 on the validation data set, which means we basically milked all out of GLM by giving it those extra features. And just to show that you can do other functions like log or cosign or whatever you want to do, you can also do it this way, right? You can just say I'll replace my capital gain with the log of one plus capital gain.
And now the data set has, oh actually I didn't replace it. I assigned it to a new column which was then added to it because it already knew there was one already. So it made a new one. If you wanted to replace it, you would have to explicitly call it with the column selection operator to say this one, replace it. So if I build another model, the training AOC will go up a little because a log is better than a natural number for money because money tends to be with interest compounds itself upright. So it gets more and more. The rich get richer, the poor get poorer. So this is where the log helps to get the rich back to the normal people. So you can make a linear model and say, yeah, it's just a little bit richer, but in fact they're actually a lot richer. And now you see here the AOC on training is .92695.
And before it was 690. So it's a tiny little improvement, but you can see that the log transforms can help obviously you know all that already if you're data scientist.
Yeah, I think here what I wanted to do was one more run with Deep Learning and I just basically took the body out of this fit function and put it out there myself. I just say here, make a split again and run Deep Learning with default options here. Just regular Deep Learning with nothing. Like the list is just empty here. I just give it the data x y training, validation N faults. But I give nothing else in this list. It's all empty. And when I run this we'll see what Deep Learning does. Basically my model's object here is now a vector that contains only one object, which is the result of H2O fit and H2O fit only, was done on Deep Learning here with no special parameters, just all defaults. It just shows you how flexible this framework is. You can do anything you want by just calling something. So this is now the conclusion of the feature demo. There will be one more on the hicks depending on how much time we have. It shouldn't be too long because it's basically using the same logic here, just looping over GLM, GBM, Random Forest and Deep Learning and showing you a little bit here about what's happened here. So this was 92.6 with Deep Learning. So it's reasonable, it's not necessarily the best, but it's reasonable.
There might have been some error somewhere here in the script, so please forgive me. But it shows that this is a reasonably flexible framework. You can just script it in R and do all kinds of stuff. So let's go look at the Hicks open file. How much? 25, 30. Actually, can we take a break for some questions? Mm-hmm, yes. Okay, good. We'll get onto the Hicks. Okay, great. So we'll have a few minutes break to discuss any questions that you might have before we go onto Hicks, so the Hicks will be in the advanced section as well. And then here Hicks, I'll just open that up for now. But we won't start with it until we have resolved all the questions there might be.