# Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai

In this talk, Dmitry is going to share his approach to feature engineering which he used successfully in various Kaggle competitions. He is going to cover common techniques used to convert your features into numeric representation used by ML algorithms.

# Talking Points:

**Speakers:**

Dmitry Larko, Data Scientist, H2O.ai

**Read the Full Transcript**

**Dmitry Larko:**

Thank you. This is going to be, I would say, a first talk maybe out of many, at least first of two. I'm going to make an introduction in feature engineering. The next talk will be about some advanced feature engineering, but as she mentioned, I do Kaggle for a living. They also force me to build a product in H2O, and I do my best in both fields. The last five years of my life I spent on Kaggle, I competed in different competitions. Usually, not that good if you can see, but not bad. Again, the topic of this talk is feature engineering and basically why I think it's very important. A lot of people across the domain, and across the machine learning society, a lot of people who are well known in machine learning agree on one thing that featured engineering is extremely important.

### Featured Engineering

Basically, what exactly we can mean by featured engineering, the very simple explanation you can see on the bottom. That's how you can transform your input so a machine algorithm can actually consume it and build good predictions. That's the easiest and simplest explanation of what feature engineering means that I was able to find. Basically, this slide is some sort of motivation. Considering we have a 2D space, we have a point with red and blue, and we would like to build a linear classifier, which is able to classify them correctly. It's not actually possible to build this linear classification given the data. With no way you can split them by just the line into separate classes. But if you transform the Cartesian coordinates into polar ones, they immediately become very easy to separate by just a single line. Of course you apply a quiet complex transformation.You can think of this transformation as a feature engineer because you just engineer a feature for a model. It allows you to build a simpler, more lightweight model compared if you try to fit a random forest into it.

### Machine Workflow

A typical machine workflow might be seen like this. You have a data integration step. The next step will be data quality and transformation. Then after the transformation and data quality check, and you have a table You can fit into your 508 machine and algorithm. Each row on this table represents one single event, and each row has a target you would like to try to predict. Basically, that's exactly a place when featured engineering can take place. Of course, you can argue and say this part is a part of data engineering as well. For this talk, we are concerned on this part only. This one is way much more complex. I would say it's a vast area for research, for future research because there is not much research done on this area. How to combine basically available structured data to get a good data set for prediction.

### What is Not Feature Engineering

What is not feature engineering, from my point of view. As I mentioned, initial data collection cannot be a feature engineering the creation of target variables, because there's something you should be business driven. You have to have some business needs to predict something, to fulfill some business goals. Removing duplicates, missing values, fix and mislabeled class. It's more like a data clean. Of course, each machine learning model should be more or less stable if you put the duplicate. Like missing values, but it's not the goal of feature engineering, it's more like a data clean. Scaling normalization is not a feature engineering by itself. It's more or less a data preparation for specific models. For example, for neural nets, you have to scale your data or always the ENT won't work as you expect it to work. Feature selection is not per se like a feature engineering, but I'm going to mention this in this talk like in a couple of places. Basically, that's a feature engineering cycle. You have a data set, you have a hypothesis to test your validators hypothesis by applying and you create new features based on the existing one. You repeat this process over and over again in pursuit of building a better model.

### Featured Engineering Cycle

What's the source of your hypothesis? Well, obviously if you are the main expert, that's a significant source of your knowledge. That's how you can build different features out of your old features. Well, if you don't have domain knowledge, of course you can use your prior experience based on the nature of the data. Is it a numerical field? Is it categorical? How many categorical levels you have, how your numeric features are distributed, et cetera, et cetera. That's exploratory data analysis you can help you with. Of course, my favorite part, you can use specific machine learning models and by analyzing the model itself, you can get some insight about how the data is structured, and what kind of feature engineering transformations you can use to get the better model.

Feature engineering is a hard problem, especially if you try and apply a powerful feature transformation like target and coding. In that type of transformation, you specifically try to encode your categories using information about the target. That's something that can actually introduce a leakage to your data, to your model. The model will be very well fitted to the training data, but it'll be completely useless in real life usage.

Again, domain knowledge is extremely important, especially if you have a specific knowledge about the nature of the data. For example, in Chevron, if you analyze the well data, how the well was actually drilled, it's a physical process. There are a lot of physical processes that could happen inside, which can be expressed using formulas. That's something the knowledge you can put inside your model as well. Of course, it's time consuming, especially if you have a lot of data because you have to run your model against it. You have to test how good the feature is. Of course, to do that you can do a thousand experiments, especially if you perform like an EDA or like previous experience charge.

Again, as I mentioned, simple models give you better results. Ideally it would be nice to find some golden features and just a fit linear model on top of it. That would be the best possible scenario because I always prefer simplicity or the complex to model. Of course, in real life scenarios, that's never the case. You still have to apply a quite complex model, like a random forest or graduate boosting or neural nets to get some results. Still, good features can help models to approximate faster. We can discuss like free key components to that. That's a target transformation, feature coding and feature extraction.

### Target Transformation

Target transformation, that's something you can use to transform your target variable. That's especially useful for regression problems. Say your target is not normally distributed, it has a skew distribution, and in that case you can apply some transformations to make the distribution of the target more like a normal shape, like a bell curve. For example, log transform usually proves to be very good in a few Kaggle competitions. For example, there was competition for Liberty Mutual: Property Inspection Prediction. You try to predict the outcome of a property inspection. On the X-axis you see different XGBoost strands. These are random parameters, on the Y-axis, you see the score of the normalized genie in that case, and you can see how models actually vary. On a green line, you see the green model, which is a log of 10, and the variation of a model is less compared to the previous model. Standard deviation of these results will be much smaller, which is actually good. That means your model is more stable, even though in some cases our models outperform it. But actually in that case, we are just looking for stability. Stability is usually better than just the best score for some single given point.

### Feature Encoding

Feature encoding, again one of the interesting topics to discuss, that's like how you can encode your categorical features because most of the machine learning algorithms, they actually usually expect you to provide numerical data, numbers basically. Your category is an agenda, for example, or a caller that's something, but not the number. Obviously, the easiest way to do that, it's a One Hot Encoding, which you are technically all familiar with. You also can do Labeled Encoding. That's a very simple technique, you just replace your category with some integer number, but that might be an interpretation because in that case you introduce some order in your data, but in most cases, let's say in color, there is no order in color. Especially, based on some randomly unsigned integers. One Hot Encoding is a good idea, of course, but it's sometimes in some cases we have like a lot of levels in your categorical data, that data becomes too huge. As an example, when labeling encodes, you have a free category, say B and C, you just map this category to some specific integer number and replace this number in your data set.

For One Hot Encoding as an example, you have, again, a specific integer but you have to create three different columns to represent these three features. There is one advantage of One Hot Encoding compared to Label Encoding. If you have a new category, it'll be all zeros, basically. You have just your bias and let's say if you fit your model into it.

Also you can do as a Frequency Encoding, you just can't encode your category using frequency. Basically, you calculate how many times you see this category in your data set and just normalize it, divide by the total number of rows you have in your data set to get the frequency. Basically, you can think of it as a probability to meet this particular category in your data. In that case, you can highlight via less frequent categories is pretty good. You have a category C just two times in a data set. That means he has a very low frequency compared to the rest of the categories. The disadvantage of that approach, if you have a category which has the same amount of frequency in data, your machine model won't be able to distinguish them. Let's say if you have a two B and two C, they both have the same frequency.

**Audience Member:**

You also introduce the order?

**Dmitry Larko:**

Yes, that I also introduced the order, but in that case, order actually means something because it's a frequency. Basically, order in this is actually a good thing because usually I use this technique to fit Tree Based Ensembles and Tree Based Ensembles usually looking for the best splitting point. Basically, I'm looking for the right order actually all this, so my model will be able to split them easily.

### Target Mean Encoding

The next idea and approach, actually you can use a Target Mean Encoding. Given the outcome you would like to predict and the features you would like to encode, you just replace each category with a mean of the outcome. In the case of A, we have a four A here, and it's three out of four, it'll be .75 probability of given feature. A, the outcome will be 1. In case of B, it'll be .66. In the case of C, once the probability of how it can be in one given category C will be always zero, oh sorry, always one. Seems to be a very good approach. But you immediately see, let's say you have a less frequent category, the information about them will not be very reliable. You just have two examples that's not very statistically significant even though both of them actually once that doesn't tell you much because it's like, well, so what, it can easily be by chance.

To deal with that we introduce instead of just encode by mean we introduce a weighted average between the mean of the level of our category and the overall mean of the dataset. We also have a function which depends on how often you can see this level in your dataset. The bigger the M, the bigger lambda that we will be here. That means your average will rely more on the mean of the level, not of the mean of the data set. It's more like a smoothen approach, and usually that's a very good step by step function, with 60 mode-like functions actually to model this kind of idea. Basically, the X key is a frequency. How often you have a category level in your data set. K is an inflection point. That's basically the point that lambda equals .5. In the case of a red line, it'll be equal to 20. In the case of a blue line, the K equals two. F, that's control of your steepness The less F you have the steeper your function. The F of the blue line is less than the F of a red line. If F equals zero, you have a stepwise step function basically, which is zero until the inflection point and then goes to one after inflection point.

As an example. That's a function I'm trying to explain. If we just run it for different cases, you see it's, we are just shifting B squeeze. Basically, I'm running for different cases from zero to four, and you can see how the function actually shifts from, given the different inflection point. F controls the stiffness of the function. The bigger, if you have, the more smooth your function will be. In case of equal zero. The simplest case and your F equal zero. That basically tells you, hey, if I have less than two examples of my category level, I'm going to use the mean of the data set. If I have more than two examples of my category level in the data, I'm going to use this mean of the level. That's it. By adding the steepness, you just smoothen the results like around two going to be 0.5.

That's exactly the example that I just showed you. In this example, the F is always pointing to 25, I'm just playing this K I'm shifting it from two to three. You can see how predictions actually varies from this change, especially for C. For if K equal two, that's exactly amount of examples I have for C, the lambda will be 0.5, and I get the elevated average between the level of category and the dataset mean, which is point .75 in that case. If I move a K to three, I immediately get the very small weight for min of a C and a very big weight for the mean of a dataset. You can see this because the bigger the key, the more conservative model becomes, the more embedded closer to the immune system of the data set. The less key, the less conservative it is, and of course it is closer to the level mean.

What else can be done? Even in that case, even if you apply a be smoothen, a very complex algorithm like XGBoosts for example can find a leakage in this dataset. What you can do, you can join your categories, which have a small frequency in your dataset. You can join them together to merge them together to create a bigger category level. That's one approach. The second approach, you can somehow introduce the noise into your data. Instead of just blindly encode using a minimum code, you can also use a leave one out approach. Basically, you encode each record using the rest of the records. In this case for row one, which has a feature, A, you find all the rest of A in your data and calculate the mean to get the mean for this particular role. For the second, you can do exactly the same, but you just exclude the second one. It's to say leave one out, you're obviously leaving one out of your data.

And for a third, and for fourth, you repeat the separation for all records in your data set. It's very time consuming, I must say. But that's very reliable, especially if you have a small data set that might be your weapon of choice. As you can see, the data here is slightly different. Basically, you can think again, as I mentioned, you can think of it as a way to introduce a noise in your data, which prevents the model from overfit, because in that case model you're trying to fit on this dataset should be more careful, and not blindly rely on one single column. You also as a one of a technique, you can, instead of doing leave-one-out, you actually can just add some noise to the found encoding. Of course, noise is supposed to be independent for each row. I'm pretty sure the noise should not be normally distributed. It should be more uniformly distributed because normal distribution, that's something that all machine algorithms are designed for. It's very good to actually approximate anything if it has a normal distribution.

That's a way how, you mean you deal basically with normal distribution always. The second thing about the random noise is that you have to somehow calculate the right range for your random noise. Let's say if a binary classification is easy to get what kind of range you would like to apply. But for a regression task, it's not that easy to get the idea of exactly what kind of randomness you should add to your data in order to not overfit.

### Weight of Evidence

Okay, the last technique, what can you use for categorical encoding? Actually, that's a well known technique in the banking domain, which is called the Weight of Evidence, it's very easy. Basically, given the level, you just calculate the percent of known events, your negative events divided by a percent of positive events and just take natural log from that. To avoid division by zero, you just add a small number to the number of non-events in the group or events in the group.

To highlight what exactly I mean by that I can I just write some simple examples. Let's say we have an A category, it has just one single non-event which is zero. It has a free event, which is once, so across the whole data set, we have nine records, and we have seven positive examples. I forgot to check that number. I'm sorry, it's supposed to be a slightly different number. I just forgot to fix it. Oh, no, actually that's right. For positive it should be like three divided by seven, the total amount of positive events. For negative, it should be like one by two. We just have two negative events in our data set, so it'll be like 50%. Repeat the same procedure for all category levels you have, and now you can take a natural log out of that which gives you a weight of evidence.

### Information Value

That's another way you can encode your particular levels and your categorical data. What else can be done? This weight of evidence has a nice addition to it. It is called information value. You can calculate information value for a weight of evidence, which calculates a difference between percent of non events and events multiplied by weight of evidence, and you summarize it across your levels. This gives you some number. This number can be used to select, basically it's your feature importance. You can use this number to pre-select the features or categories you would like to use. A rule of thumb here is quite simple. If the information value is less than .02, that's not useful for prediction value at all. From .02 up to 0.1, it's a big predictive power. From .1 to .3, that's a medium. From .3 to .5, it's strong. If it's more than that, it's a very suspicious column. You might have some leakage in your data, so you have to take a look at exactly what you did. Maybe you made some mistake, maybe your data has a very long tail of infrequent categories, which has something like say zero or one. Some investigation supposedly needs to be done, but basically if you try to fit this feature into your machine and model, you will ever fit.

What other nice feature, nice property of this particular in encoding, you also can use the same approach to encoded numerical values. For numerical, because you can calculate information, information value, you can play these basically, let's say you have a numerical feature, you can binarize it. Be using the quant tiles for example, but you also can actually play with these quant tiles and merge some of them given an information value of the old column. Or for example, let's say you quantize your numerical value and you can you the corporate weight of evidence. If your category level has the same weight of evidence, you actually can merge them together because for machine and algorithm, there is no difference for them. They have exactly the same ratio inside.

Why don't you bother to actually keep them separate. Which is extremely useful in case you're trying to binarize them and basically treat numerical values as categorical. The good thing about numerical features, you can use them as they are. Especially for tree based models. You don't have to scale them, you don't have to normalize them, they are just as good as they are. But you also actually can treat them as a categorical by bin them into quantiles or using histograms basically. You can use a bin of the same size, which is a histogram, or you can use a bin which has the same population in size, which is a quantile. What can be done, as I mentioned, you can just encode them using categorical encoding schema, or you can just replace using bin's means or median.

Usually it's very handy in case you have a sensor data, especially if you have a sensor that's usually supposed to be across some values, but because the sensor is usually oscillating all the time. We have a tiny little change because of some error measurement, but if you don't need the exact value, but you just need a level approximation, that's the way you can do it. Also, you can apply a different dimensionality reduction technique to several numerical features to get the smaller representation of the same features. My intuition behind this, say you have three numerical features, you can encode a CD, you can just reduce them to one single feature, and that feature might be useful for tree based models, because tree based models usually vary better in approximate linear dependency. But SVD and PCA, because the linear nature then actually can be quite good in the approximation, some sort of a linear dependency for you.That'll be a very good support for tree based methods.

### Numerical Features

Also, for numerical features, again, you can cluster them. You can just use a K-means to cluster them, and then you have a cluster IDs, which you can treat as a categorical again by using categorical encoding schema. Which is kind of very useful, you can calculate the distance for each row in your dataset, you can calculate the distance to the centers and thus give you a whole set of new features.

Not to mention about simple features like a power of two, for example, square roots or even addition and multiplication for example. We all know that a random forest is a very good approximator. It can approximate literally anything, but let's say to approximate this kind of dependency for random forest, it took like a lot of trees actually to build. If you just provide the power of two features and the raw features as well, it'll require way much less treatment to approximate way much better.

### How to Find

That's the introduction to the next part, we are just trying to discuss a simple approach, we discuss how we had another view of the tool. What exactly we can do with the features, how we can represent them in a numerical forum. But the thing is how exactly we find the right representation, how we find different feature interactions to help our model to converge faster and maybe produce a smoother decision curve. Well of course it requires the main knowledge, you can analyze the machine learning algorithm behavior. You can analyze GBM splits, obviously you can analyze linear regression, for example. That's something I would discuss on the next meetup in advance and feature engineering. The question is how can you encode different feature interactions? Well, for numerical features, again you cannot add different mathematical operations. You also can use clustering or let's say K neighbors to create some features for you. If you have a pair of categorical features, you can combine them together, treat them as a new category level. You have a pair of categories and you just encode them using any provided schema as well. If you have a combination, you would like to encode the interaction between categorical and numerical features. For each categorical level, you can actually calculate different statistics of numerical features, like a mean, median, standard deviation, usually mean and standard division is quite helpful. Min, max might be, but it depends on the problem you have at hand.

### Feature Extraction

Feature extraction, given the raw data, you would like to extract some features. For example, if you have a GPS coordinates, you can download the third party dataset, you can map basically GPS coordinates to the zip codes. Now in the zip code, you can get information about population or any other different statistical information. If you have a time in your data, you can extract the exact year, month, day, hour or minute, time ranges. If you have a holiday calendar, it's actually usually helpful for retail chains. You can have a flag, is it a holiday or not, for example, what kind of holiday you have. As I mentioned before, a couple of times you can actually tune new numbers, like say age into ranges usually quite helpful, especially for age.

### Textual Data

Well, for textual data, like a classical approach has a Bag of Words, obviously. But we all know that I won't stop here actually. Of course you can use a Deeper Learning as well, especially if you do a word2vec or doc2vec inbetween. The beauty of it is basically right now you don't have to train anything yourself. You can use pre-trained word vectors and use them like they are in your model. For example, let's say you have a short document in your data set, you can transform your words into vectors and then calculate an average vector and that'll be your doc2vec representation of your data. I think that's it. I can ask a question, thank you. Question on the west side.

### Q&A with Dmitry

**Moderator:**

Lots of questions, which is great. Are you doing a redox talk at H2O World as well?

**Dmitry Larko:**

Yeah, I do the talk, but I will actually talk about a different topic.

**Moderator:**

If you want to know more about the higher ends, good intimate talks next week, Monday, just at the computer museum not far from here at H2O world. If you're not signed up, I think it's probably maybe a half a dozen or maybe a dozen seats left, fully sold out. 700 people have signed up, but I'm sure there's a meet up. I don't know what you guys are doing. Dmitry is speaking again, but there's a lot more exciting speakers. Really interesting.

**Dmitry Larko:**

We're going to have a Kaggle number one actually on our panel. Mario's Kaggle number three, former Kaggle number one grandmaster is going to have a talk and be on a panel as well. I'm going to be in a panel and I'm going to have a talk as well.

**Moderator:**

Could you go back for once more on that? You should make a gif out of the interaction of the feature engineering where you do, how the target and what is happening.

**Dmitry Larko:**

We can have a gif file, actually that's a nice idea.

**Rosalie:**

Awesome. Thank you all for submitting questions. If you still want to submit some questions just use the Slido hashtag, we'll be answering them now. Hi everyone, I'm Rosa Lee, I'm our Director of Community. Thank you all for coming tonight. Dimitri, I figured we could just ask you the questions if you'd like.

First question for you, what are the cases where it took forever to compute and what did you do about it? Are there specific hardware or configurations, so it does not take so long?

**Dmitry Larko:**

Well it depends exactly what we're trying to compute. If we're just trying to compute something like a target in encoding, well, in that case, usually that's because I wrote a shitty code and I just have to rewrite it. If it's a machine learning model, I would like to speed up, I might switch to a smaller model. I might sub-sample the data, or because I work on H2O I have access to like a cluster. Basically, I can spin up a H2O cluster and run on a cluster. That's actually exactly what I do all the time. I just borrowed the cluster from H2O.

**Rosalie:**

It says here, is frequency encoding similar, or same as Huffman Encoding, which is used in communication systems?

**Dmitry Larko:**

I think so actually. Yeah, very close. I think that's exactly how I found it, for the first time.

**Rosalie:**

How can one decide which is the best category encoding?

**Dmitry Larko:**

By random chance. I mean by checking all of them and finding the best, basically. It requires some gut feeling, but from my expertise, target encoding usually proves to be the best. It's my weapon of choice.

**Audience Member:**

That was my question but I wanted to.

**Dmitry Larko:**

Oh, V-metric that's a good question, well, obviously it depends. The metric actually depends on your business problem you're trying to solve. The whole setup actually, you have a raw data, you're trying to do different feature engineering, you fit it to your model, you check it on your validation set using a metric you select actually visa business, and that's how you got the results. If you give a significant improvement, you apply this feature as well. If you don't, well, you try it again and again and again. The metric actually is a business question. Let's say I have a very good example then the company actually would like to predict something. Let's say at the beginning of the quarter, if you overpredict or you under predict, but as soon as you're getting close to the end of the quarter, their over prediction actually cost them so much money.

They would actually like to be smaller. That requires you to design a metric yourself, basically. But again, it's nothing to do with the feature engineer. You can use exactly the same features. You just see how your model actually fits to the metric of choice basically. But mostly what you can do, let's say if you have AUC as a metric and if you have a classification and AUC as a metric, in that case, you can encode your categories and immediately actually check it against AUC. Obviously, that's very easy to do, and something that can be done. Same goes for classical metrics. If you have a random C for example, you would like to encode your category. Basically, your category encoding, special target encoding, it's a small, simple model by itself. You can directly measure if you like, but it's not something I actually highly recommend because in most of the practical cases even though this simple comparison can be good, the fundamental model can be bad actually. There's no guarantee.

**Rosalie:**

Another question from that attendee, they're curious what to do with a feature with very large categorical values?

**Dmitry Larko:**

You mean, then we have a lot of levels? You can apply both techniques at the same time. You can do a smoothing and leave-one-out. Leave-one-out is extremely expensive, but instead of leave-one-out, you can actually do cross validation. You just split your data in the five chunks, for example. You try to encode them using four chunks and apply this font and encode actually to a fund, and you repeat this process for all five categories. Obviously, the fastest way just to apply random noise, which is, could be a replacement, but in that case, you have to design it carefully. I didn't include random noise here, just for one reason. I don't have an understanding actually how exactly actually to what kind of noise do I, what's the rules actually if I didn't noise. That's why I don't have it in my slides.

**Rosalie:**

Many great questions coming in. Thank you all. Another question, it's claimed that neural networks will make feature engineering obsolete. What do you think and why?

**Dmitry Larko:**

In image recognition, in speech recognition, in what else? In textual data maybe? Yeah, definitely. As soon as you have unstructured data, like image, sound, that's the best way they can design features for you. If you have a structured data set, like the dataset, we work on a database, you have a lot of data, which is structured not so great, actually. Usually they perform poorly compared to g-based methods, especially if you are empowered by smart feature engineering. You still can actually use the neural net on structural data, but it still requires you to carefully prepare the features for it. It's an active area of research, I'm pretty sure, but it's nothing smart actually, we're just waiting until final results.

**Rosalie:**

Any techniques on feature extraction for time series or historical data?

**Dmitry Larko:**

Yes, a lot of them actually. But most of the things you can do are different like features. That's obviously what you can do with your time series data. The key to success is to carefully apply your validation schema, especially if you would like to predict something like two weeks ahead, for example. In that case, given the data, you can't use your last two weeks of your data set, your features should be created on the previous data, which. The key here, by your validation set, you're trying to model the actual real life scenario. But like features, usually exponential moving is very helpful. That means your most recent events are actually more important than the previous events. If you are given the tree based methods, of course, I mean of course you can apply different techniques, but it's not that fun.

**Rosalie:**

I'm seeing, will you post a presentation online? Yes, we will. You can find it on our YouTube in a couple days. What are the methods of encoding a list of elements?

**Dmitry Larko:**

That depends, if the list has a sequence, if they order it somehow, you can just create them as a separate column. If there is no sequence in it, obviously you can one-hot and encode them because that's something that usually you do in the marketing. That's how you actually deal with these kinds of events. Usually it's a very common task in marketing given the set of events, you're trying to predict user behavior or let's say you're given the user click events, you're trying to predict something out of a user. I do have a couple of ideas, but I need to be checked before I can share them. But obviously it's not very easy compared to the given problem.

**Rosalie:**

Can we pile up all these methods together, run the model, and find out the best one?

**Dmitry Larko:**

That's something you can do, you can do that. Although, usually it's very expensive. Say you have a data set which has 500 columns, which you found useful. If some of them are categorical, you have to encode them using every given categorical schema might show to you. Then you have to apply, let's say feature selection methods.

Which could be, feature elimination. Or you can fit a linear model and use an L1 organization to get the feature which has weights. But I don't think it's a good idea. Usually, if you just do it right away you get better results because the feature selection methods which are available right now, they're not very reliable. In terms I've found like 10 features out of your one hand. That doesn't mean I actually found all of them. You might still have some random features inside, you can still have some features actually missing. I wouldn't recommend that actually. One by one feature by feature item usually provides the most stable solutions.

**Rosalie:**

Can you target mean encoding with H2O Flow?

**Dmitry Larko:**

No, we can do it with our product, which I was forced to work on. I was actually instructed not to mention it.

**Rosalie:**

Another great question. Thank you all for the questions. I see more just coming in. Have you come across examples where feature engineering was applied in physical security, especially in a domain like banking?

**Dmitry Larko:**

In banking? Yes. Let's say, Flow detection, anomaly detection usually apply and use a different tool set to do the same stuff. We should have actually invited him, he can talk for days about that. I'm pretty sure.

**Rosalie:**

I will make a note on my Trello. Question here. Is there any good feature encoding tool, for example, a tool to help compute target mean encoding?

**Dmitry Larko:**

Yeah, I think there are some actual scripts in the wild for Python. I don't remember the links, but obviously you can find this very script for Python for sure on the Kaggle site. Just Google it, go onto Kaggle, and Google Target Encoding. I like searching on GitHub to do exactly the same. Just Google GitHub for Target Encoding, which sure you find something. I might, let's say if I found the package I'm referring to, I will post the link to the page.

**Rosalie:**

How to deal with complex objects having partially unknown attributes?

**Dmitry Larko:**

Can I have an example? What exactly meant by that?

**Audience Member:**

For example, and we don't know who that is, okay, so user is person H and then.

**Dmitry Larko:**

You're talking about infusion? Basically, saying we don't have some information about the user's mission.

**Audience Member:**

For some users we've got data information.

**Dmitry Larko:**

We don't have an H for one particular user we have forever. Well, it depends, you can apply different inputting techniques to do that. Obviously, the simplest one you can just replace missing values by a mean. Kind of stupid, but it works in some cases. It also depends, sometimes your missing values can be due to the mistake. They're not supposed to be actually empty, but very empty. It's a mistake in your data gathering process. Sometimes missing value actually can mean something. It can be information, value is missing. There could be some reason behind that. In that case you can actually leave missing values as they are. Modern methods like XGboost can actually handle missing values pretty easily.

On each split it decides where exactly to pass all the missing values depending on the loss function. Engulfs the tree and basically on each split it decides left or right, it puts all the missing values, a pretty powerful technique. In case of encoding, let's say for categorical, just treat the missing values as a separate category. Same goes for numerical values. It's a separate bin. A rule of thumb, you replace your missing values for numericals you replace via mean and you also add a flag feature, binary feature, it tells you, that it was a missing value actually zero and one. Instead of one column you will have a two. But usually that's a good starting point, especially for neural nets and linear models because you keep both information. You’re replacing it with mean variation. In that case, you went through normalization schema, but you also keep information that was actually missing value there. It's usually very helpful.

**Rosalie:**

Do you prefer certain models for categorical data such as random forests?

**Dmitry Larko:**

XGBoost mostly, but yeah, again, and it depends Because let's say in case of the events, like you have a list of events, obviously that's like what I could do immediately I'll just replace everything if one hot encode can basically fit in your model to see what happens. Usually, events are very sparse and like a very wide space, that's enough. It's already provided over a very good model.

**Rosalie:**

Any other questions? I think we're good on questions here. I was actually just going to pass the microphone around if anybody wanted to ask any questions. I mean, if you do, just raise your hand and I'll come on over.

**Audience Member:**

You were talking about combining categories into one if they have all almost the same predictive power, so it makes sense if your model is completely statistical, does it mean that you can combine them by the same prediction power only after you, like you made sure that your features are independent, for example, with some primary analysis?

**Dmitry Larko:**

That's what the model actually sees in the data. Basically because I always see the same rate of evidence number in the model itself. I mean there's no way to distinguish if this is 0.4 value for A category, is this 0.4 value for B category. It's already done, basically. It can be a good idea if you bring you a good insight for numerical features mostly because it shows you can actually combine the bins. It does make sense to combine and enable bin, but if you like say a bin 1 and bean 10 with the same weight of evidence, it's strange to combine them obviously.

**Audience Member:**

They are like natural water.

**Dmitry Larko:**

Ideally, yes. This March on GTC conferences there is a paper about how you can actually predict the anomaly using the record neural nets and that's exactly what they did. They binarize by numerical features, but they force neural nets to always keep the same order of them. They try to learn embedded, but if embedded. For example, it should be an extended order If embedding is actually in the right order, they just shift the weights all the time. It's a strange idea, but they always like to force numerical values to have the same natural order that they're supposed to have.

**Rosalie:**

All right, who is next? I'll have you go first and then I'll have you go.

**Audience Member:**

When we are using frequency for some feature, and then we want to apply the modeling predict, that then comes the real input. How do we get the frequency there?

**Dmitry Larko:**

During your training, you learn the mapping table. If you have test data you look at this look up table to the test data. This data is never changed during the interference process. You learn this frequency out of your train data. If you have a new data set, like a test data set you or you try to predict you never calculate anything. You just use the found values. This immediately brought the question actually, what if I have a new level right in this data set? Well, you can apply missing values if your machine learning model is actually able to handle missing values. You can just say zero because you never solve this data during your training.

**Rosalie:**

Thanks so much. All right. I'm coming up here for you.

**Audience Member:**

How would you compare the H2Os use of categorical variables, which you can learn directly from, compared to using feature encoding?

**Dmitry Larko:**

Not sure I'm following your question.

**Audience Member:**

Well, H2O supports categories as inputs.

**Dmitry Larko:**

It's under the hood just encodes them using one hot encoding or it can encode them. Or it can do one hot encoding then just do a demand dimensional reduction just done you never, you never see it by yourself. The third way it can do a labeled encoding, that's all H2O does inside. You just don't know that.

**Audience Member:**

I just never quite figured out what they did with Deep learning.

**Dmitry Larko:**

What you're referring to is a lookup table. Why don't we just learn the dates? There is a common technique in neural nets, but H2O Deep Learning does not do that. It still does one hot encoding, but it can do it on the fly.

**Audience Member:**

I think for trees it does use the categorical variable directly.

**Dmitry Larko:**

Not directly. You have to have some numerical representation because that's how you are able to build a histogram on it and use this histogram to find this experience point of it. There is one approach which tries to use categorical values directly. There's a company called Yandex, and they recently released a CatBoost decision tree and they stated that this CatBoost is actually able to handle categories as they are, like raw categories data. But I didn't check it myself. Results are controversial in some cases it performs extremely good and some cases performs extremely bad. Nothing in between.

**Audience Member:**

I have a question. You mentioned gradient boosted decision trees or regression trees. I have a question which is probably not directly related to this lecture, but I think it could be interesting to many people. For example, when I used to work on query URL ranking, then gradient boosted regression trees, they worked quite well. Recently, actually about a couple of weeks ago, I tried to use this kind of boosting with deep learning in computer vision. Basically, what I did after we trained the network I did the inference on the whole train set, and those pictures that did the worst results, they were like prioritized. They will randomly select more frequently and this didn't work at all. What is in your opinion?

**Dmitry Larko:**

There was one paper actually about how we can boost CNN. You can Google an archive by itself, I think it's called boosted CNN or something. There was a specific architecture they tried to learn first of all, but they got pretty good results. But I tried to repeat them, but it was pretty good. But yeah, I wouldn't say it's, it's the best of the class. There is another technique you actually might find helpful. I think it's called Online Heart Example Mining. That's something you can actually try for yourself. Idea is quite simple, given the batch you calculate the loss, you have a gradient but you pass in view back only the gradients of the examples, which was the hardest one. Not all of them, but just the hardest one. But of course there is a question how to actually define which proportion and all this stuff. I haven't read the paper. I don't have the details.

**Audience Member:**

The difference between what you are saying and what I was mentioning, that I try to determine the worst ones by just full inference. But what you're suggesting is that we are still running it in back propagation mode, which means they have like a drop, so the loss is not real loss.

**Dmitry Larko:**

If you ask my opinion, which is actually not supported by anything, just my gut feeling. You see in boosted trees, each and every tree is different. It's shaped differently depending on what exactly it is trying to predict. In neural nets, you always keep the same architecture on each step. It's the same architecture, just lights. I think that's the biggest difference. Yes, it's a different model because you learn different weights, but it's not something that defines a different model forming in that case. That's my gut feeling. I don't know, actually because I did exactly the same as you did. I just tried, what if I just fit a neural net inside the boost in schema, which is easy to do in machine learning. You just apply a method and you're done. Nah, it didn't work at all. I mean, the first one is good and that's it basically.

**Audience Member:**

Okay, thanks.

**Rosalie:**

Thank you. Any other questions? If you raise your hand, I'll bring you the mic. Let me check the slider real quick to see if folks maybe use that as well. Let's see here. I think that's all the questions, Dmitry. Thank you so much for your talk tonight, and thank you all for coming. Thank you.