Feature Engineering with H2O

In this talk, Dmitry is going to share his approach to feature engineering which he used successfully in various Kaggle competitions. He is going to cover common techniques used to convert your features into numeric representation used by ML algorithms.

Talking Points:

Feature Engineering
What is Not Feature Engineering
What Makes Featured Engineering Hard
Target Transformation
Feature Encoding
One Hot Encoding
Target Mean Encoding
Target Mean Encoding- Smoothing
Leave-One-Out Approach
Numerical Features
Feature Interaction- How to Find?
Textual Data
Q/A with Dmitry

Speakers:

Dmitry Larko, Data Scientist, H2O.ai

Rosalie, Director of Community

Read the Full Transcript

Dmitry Larko:

Thank you. This is going to be, I would say a first talk maybe out of many, at least first of two. I'm going to make an introduction to feature engineering. The next talk will be about maybe some advanced feature engineering, but as she mentioned, I do Kaggle for living. They also force me to build a product in H2O, I do my best in both fields. The last five years of my life I spent on Kaggle, I competed in different competitions. Usually, not that good if you can see, but not bad. Again, the topic of this talk is feature engineering and basically why I think it's very important. A lot of people across the domain, across the machine learning society, a lot of people who are well known in machine learning, they agree on one thing that feature engineering is extremely important.

Feature Engineering

But basically, what exactly we can mean by feature engineering the very simple explanation you consume on the bottom. Basically, that's the way how you can transform your input so machine and algorithm can actually consume it and build a good predictions. That's the easiest and simplest explanation of what feature engineering means I was able to find. Basically, this slide is some sort of a motivation. Consider we have a 2D space, we have a point right with a red and blue, and it would like to build a linear classifier which able to distinguish them correctly. That's not possible actually to build this kind of linear classified given the data with no way how you can split them by just line actually into separate classes. If you transform the Cartesian coordinates into polar one, they immediately become very easy to separate by just a single line.

Of course, you apply a quite complex transformation. You can think of a transformation as a feature engineer because you just engineer a feature for a model, but it allows you to build a simpler, maybe a more lightweight model. Compare if you try to, let's say, fit a random force into it. A typical machine workflow might be seen like this or, you have a data integration step. The next step will be like a data quality and transformation, then after the transformation and data query check and you have a table you can fit into your five eight machine learning algorithm. Each row on this table will represent the one single event, and each row has a target you would like to train predict, and basically, that's exactly a place feature engineering can take place. Of course, you can argue and say like hey, this part actually is a part of data engineering and as well, but for this talk we are on this part. This one is a way much more complex, and I would say it's a vast area for research, for future research because there is not much research done on this area. How to combine basically available structured data to get the good data set for prediction.

What is Not Feature Engineering

What not feature engineering from my point of view? As I mentioned, initial data collection cannot be a feature engineering the creation of target variable because there's something you should be business-driven. You have to have some business need to predict something for to fulfill some business goals. Remove and duplicates, missing values, fix and mislabeled classes, it's more like a data clean. Of course, each machine models should be more or less stable if you put the duplicate like a mislabeling. It's not the goal of feature engineering, it's more like a data clean. Scaling normalization, it's not a feature engineering by itself. It's more or less like a data preparation for specific models. Let's say, for example, for neural net, you have to scale your data or always the ENT won't work as you expect it to work. Well, feature selection is not per se like a feature engineering, but I'm going to mention this in this stock in a couple of places. Basically, that's a feature engineering cycle. You have a data set, you have a hypothesis to test your validate hypothesis, you apply to that by applying, and you create a new features based on the existing one. You repeat this process over and over again in pursuing of building better or better model.

What Makes Featured Engineering Hard

How you can come with us, what's the source of your hypothesis basically, right? Obviously, if you are the main expert, that's a significant source of your knowledges. That's how you actually can build different features out of your own features. Well, if you don't have a domain knowledge, of course, you can use your prior experience based on the nature of the data. Is it aner field is a categorical, how many categories levels you have, how your normal features is distributed and et cetera cetera. That's exploratory data analysis you can help you with. Of course, my favorite part, you can use a specific machine in models and by analyzing the model itself, you can get some insight about how the data is structured and what kind of feature engineer transformations you can use to get the better model. Feature engineer is kind the hard problem, especially if you try and apply a powerful feature transformation like targeted encoding because in that type of transformation you explicitly try, specifically try to encoding categories using information about the target. That's something that can actually introduce a leakage to your data, to your model and the models will be very good fitted to the training data, but it'll be completely useless in a real-life usage.

Again, that the domain knowledge is extremely important, especially if you have a specific knowledge about the nature of the data. Let's say for example in Chevron, if you analyze the well data, how well actually drilled. That's a kind of physical process where is a lot of physic processes got happen inside which you actually can be expressed using via formulas. That's something the knowledge you can put inside your model as well. Well, of course, it's a time consuming, especially if you have a lot of data because you have to run your model against it. You have to test how good the feature is, and of course to do that you can run for a thousand of experiments, especially if you perform a EDA or previous experience charge.

Again, as I mentioned, simple model models give you better results. Ideally it would be nice to find some golden features and just a fit linear your model on top of it, right? That's would be the best possible scenario because you prefer all. I always prefer simplicity over a complex to model. Of course in real life scenario it's never the case. You still have to apply a quite complex model like a random forest, radiant voicing or neural net to get some results. But still a good features can help model to approximate faster, and we can discuss a free key components to that. That's a target transformation, feature and coding, and feature extraction.

Target Transformation

Well, target transformation, that's something you can use to transform your target valuable. That's especially useful for regression problem. Say your target is not normally distributed, it has a skewed distribution and that case you can apply some transformations to make the distribution of the target more like a normal shape like a bell curve. For example, log transform usually proved to be a very good in a few of Kaggle competitions. For example, there was a competition Liberty mutual property inspection prediction. You try to predict the outcome of a property inspection, and on a X-axis you see a different strand. These random parameters on Y-axis, you see the score normalized G in that case, and you can see how models actually varies. On a green line you see the green model, which is a luxury from by luxury of move 10. The variation of a model is less compared to the previous model. Standard deviation of this results will be way much smaller, which is actually good. That means your model is more stable even though in some cases our models out perform it. Actually in that case, we are just looking for a stability. Stability is usually better than just the best score for some single given point.

Feature Encoding

Feature Encoding, and again one of the interesting topic to discuss as how you can encode your categorical features, right? Because most of the machine algorithms, they actually usually expect you to provide aner data like numbers basically, right? Your category is like a agenda for example or a color that's not something but not the number. Obviously the easiest way to do that, it's a One Hot Encoding, which you all familiar with. You also can do label encoding. That's that's a very simple technique. You just replace your category with some integer number but it might be a very interpretation because in that case you introduce some order in your data, but in most cases, let's say in color there is no order in color basically, especially based on the some randomly unsigned integers. One Hot is a good idea of course, but sometimes, I mean in some cases say you have a lot of levels in your category data that becomes too huge. As an example, you label and code, you have a free category, say B or C, you just map this category to some specific integer number and you replace this number in your data set.

One Hot Encoding

For One Hot Encoding example, you have again this D category, but you have to create a free different columns to represent these three features. But there is a one advantage actually of One Hot Encoding compared to a label encoding. Let's say if you have a new category you can just it'll be all zeros, right? Basically you have just your bias and let's say if you fit in your model into it.

Also, you can do as a frequency encoding, you just can include your category as using frequency. Basically, you calculate how many times you see this category in your data set and just normalize it. Divide by a total amount of roles you have in your data set to get the frequency. Basically, you can think of it as a probability to meet this particular category in your data. In that case you can highlight via less frequent categories pretty good, right? C just you have a category C, just two times in a data set. That means it has a very low frequency compared to the rest of categories. The disadvantage of that approach, of course if you have let's say a category which has the same amount of frequency and data. You won't be able to, your machine and model won't be able to distinguish them, right? Let's say if you have a two B and two C, they both have the same frequency. Also, you're introducing the order.

Yes, that I also introduce the order, but in that case order actually means something, right? Because that's a frequency, right? Basically, ordering this thing, actually it's a good thing because usually I use this technique to fit a tree-based ensembles. Tree-based ensembles usually looking for best splitting point, right? Basically, I'm looking for the right order actually all this to so my model will be able to split them easily.

Target Mean Encoding

Okay, the next idea and approach actually you can use a target mean encoding. You just given the outcome you would like to predict and the features you would like to encode. You just replace each category with the mean of outcome. In case of A, we have a four A here, right? It's three out four will be point 75, probability giving feature A, the outcome will be one. In case of B, to the .66. In case of C, because we always have a once the probability of how it can be in one given the category C will be always zero. Wait, oh sorry, always one. Seems to be a very good approach but you immediately see, let's say you have a less frequent categories, the information about them will be not very reliable, right?

You just have two examples that's it's not very statistically significant even though both of them actually wants, that doesn't tell you much, right? Because it's like well so what? It's can be easily been by chance. To deal with that we introduce instead of just encode by mean we introduce actually a weighted average. Between mean of the level of our category and the overall meaning of a data set. We also have a function which depends on how often you can see this level in your data set. The more bigger the n, the lambda, that we will be here, right? That mean your average will be more relying on the mean of the level, not of the meaning of the data set. It's more like some sort of a smoothen approach and usually that's a very good step by step look. Step function which to mode like function actually to model these kind of ideas. Basically, each key is a frequency. How often you have your category level on your data set. Key is an inflection point, that's basically the point them lambda equal 0.5. In case of a red line it'll be like a equal to 20. In case of a blue line, the K equal 2. F actually that's control of your steepness, right? The less F you have the most steeper your function. Blue line has less F, F of the blue line is a less F of a red line. If F equals zero you have a stepwise step function basically which has zero until the inflection point and then goes to one after inflection point.

As an example, nope, sorry. Okay, so that's a function I'm trying to explain and if we just run it for different keys, you see it's we just shift in, oh I'm sorry. How do I? Basically, I'm running for different case from zero to four, and you can see how function actually shifting from given the different inflection point. F controls the steepness of the function but bigger if you have the more smooth your function will be. In case of equal zero, it will be the simplest case than your F equal zero. That basically tells you, hey, if you have less than two examples of my category level, I'm going to use the mean of the data set. If I have more than two examples of my category level and the data, I'm going to use this mean of the level. That's it. By adding the steepness, you just smoothen the results around two going to be 0.5. So let's go back to the eight. No, oops. All right.

Target Mean Encoding-Smoothing

That's exactly the example what I just showed you. In this example, the F is always point to 25, I'm just playing this K, right? I'm shifting it from two to three, and you can see how predictions actually varies from this change, especially for C, right? For if K equal two, that's exactly amount of examples I have for C the lambda will be 0.5, and I get the elevated average between the level of a category and the data set mean, which is .75 in that case. If I move a K to three, I immediately get the very small weight for mean of a C, and a very big weight for the mean of a dataset. You can see I move in, I mean the bigger K, the more conservative model becoming, the more embedded closer to the mean of the data set. The less K, less conservative it is and that closer to the, of course a closer to the level mean. What else can be done? Even in that case, even if you apply a be smoothen a very complex algorithm like a extra boost for example can find the leakage in this dataset. What you can do, you can join your categories which has a small frequency in your dataset. You can join them together to create a merge together to create a bigger category level. That's one approach. The second approach, you can somehow introduce the noise of your data into your data and that's a, so instead of just blindly encoding using the minimum code.

Leave-One-Out Approach

You know can also use a leave-one-out approach. Basically, you encode each category, each record using the rest of records. In this case for the letter for the row one which has a feature A, you found all the rest of annual data and calculate the mean to get them for this particular row. For the second you do exactly the same but you just exclude the second one. It's say leave one out, you're always leave one out of your data. For third and for four you repeat the separation for all records in your data set and it's kinda very time consuming I must say. But that's very reliable, especially if you have a small data set that might be your weapon of choice. As you can see, the data here is a slightly different. Basically, that's you can think again as I mentioned, you can think of it as a way to introduce a noise in your data which helps you to which prevent the model from over because in that case model you're trying to fit on this data set should be more careful, not blindly rely on one single column.

You also as one of a technique, you can instead of doing leave-one-out, you actually can just add some noise to the found encoding. Of course, noise supposed to be independent for each row. I'm pretty sure the noise should be not normally distributed. It should be more like a uniformly distributed because normal distribution, that's something that all machine algorithm design for, right? Basically, it's very good in actually approximate in anything which case a normal distribution expected value actually is.

That's the way how you mean you do basically with a normal distribution always. The second thing about the random noise actually you have to somehow calculate the right range for your random noise. Let's say if a binary classification is kind easy to get what kind of French you would like to apply, but for regression task it's not that easy to get the idea how exactly to what kind of randomness you should add to your data in order to not over fit. Okay, the last technique, what can you use for categorical in Kaggle? Actually that's why well known technique in a banking domain which called the weight of evidence, it's very easy. Basically, given the level you just calculate the percent of non events, your negative events divide by a percent of positive events and just take natural luxury from that. To avoid division by zero, which can mean instead you just apply small, you just add small number to the number of non events in the group or events in the group.

To highlight what exactly I mean by that I can I just write some simple example, right? Let's say we have a category, it has just one single known event, which is zero. It has a free event which is once, so across the whole data set we have a nine records and we have a seven positive examples. I forgot to check that number. Basically, I'm sorry, it's supposed to be slightly different number, I just forgot to fix it. It should be three divided by, Oh no actually that's right, so for positive it should be like three divided by seven, the total amount of positive events. For negative, it should be like one divide by two. We just have a two negative events in our data set, so it'll be like 50%. You repeat the same procedure for all curricular level you have and now you can take a natural log from out of that which gives you a rate of evidence.

That's another way how you actually can encode your particular levels and your categorical data. What else can be done? This rate of evidence has a nice addition to it. It calls information value. You can calculate information value for a rate of evidence which calculate as a difference between percent of non events and events multiplied by way of evidence and you summarize it across your levels. This give you some number. This number can be used to select or actually basically it's your feature importance. You can use this number to pre-select the features or categories you would like to use. A rule of thumb here is quite simple, if information value is less than a point or two, that's are not useful for prediction value at all. From up to 0.1 it's a big predictive power. From one to three, that's a medium. From three to 0.5 it's a strong, if it's more than that, it's a very suspicious column. You might have some leakage in your data so you have to take a look what exactly you did. Maybe you made some mistake maybe like your data has a very long tail of infrequent categories which has a always say zero or one. You some investigation supposed need to be done. But basically if you try to fit this feature into your machine model, you will over feed.

Numerical Features

What else can I mean what other nice property of this particular in column? You also can use the same approach to encode numerical values, but you can actually can because you can calculate information value. You can play these basically let's say you have an numerical feature, you can binning it, right? Be using quantiles for example, but you also can actually play with this quantiles and merge some of them given an information value of the old column. Or for example, let's say you quantize your numerical value and you calculate rate of evidence. If your category level has the same rate of evidence, you actually can merge them together because for machine algorithm there's no difference for them. They have exactly the same ratio basically inside. Why don't you actually keep them separate, which is extremely useful for in case if you're trying to binerize and basically treat you in the numerical values as a categorical.

That's basically exactly, so the good thing about numerical features, you can use them as they are, especially for tree based models. You don't have to scale them, you don't have to normalize them, they just good as they are. But you also actually can treat them as a categorical by being them into quantiles using them as histograms. Basically you can use a bin of the same size, which is a histograms or you can use a bin which has the same population in size, which is a quantiles. What can be done, as I mentioned, you can just include using any categorical schema or you can just replace with a bin means or median. That's kind of approach, usually it's very handy in case if you have a sensor data especially you have a sensors that's usually supposed to be across some values but because for the sense of that usually oscillating all the time.

We have tiny little changes because of some error measurement, but if you don't need the exact value but you just need some actually say a level approximation, that's the way how you can do it. Also, you can apply a different design reduction technique to several numerical features to get the smaller representation of the same features. That's kind could be useful again, for my intuition behind this, say you have a free numerical features, you can use trake cd, you can just reduce them to one single feature and that feature might be useful for three based models because three based models usually vary better approximate linear dependency. But SVD and PCA because the linear nature then actually can quite good in the approximation some sort of linear dependency for you. That'll be a very good support for three based methods. Also, for numerical features, again you can use cluster, you can just use a K means to cluster them, then you have a cluster ID which you can treat as a categorical again and using a categorical encoding schema. Or, which is kind use very useful, you can calculate the distance from for each row in your data set you can calculate the distance to the central and that's give you a whole set of new features.

Not to mention about a simple features like a square power of two for example square roots or even addition and multiplication for example. We all know basically random forest is a very good approximate, I can approximate literally anything. But let's say to approximate this kind of dependency for random forest, it took a lot of trees actually to build. If you just provide basically the power of two features and the roll features as well, it'll requires way much less tree to approximate way much better. That's kind introduction to the next part, so basically we just trying to discuss a simple approach how we can basically we discuss, we had overview of the tool, what exactly we can do with the features, how we can represent them in numerical forum. But the thing is how exactly we, let's say find the right representation, how we find different feature interaction to help our model to converge faster and maybe produce a smoother decision curve while of course require the main knowledge.

Feature Interaction- How to Find?

You can analyze the machine algorithm behavior, you know can analyze GBM splits while obviously you can analyze linear regress rates for example. That's something I would discuss on the next meet up advanced feature engineering especially because let's say you have. The question is how can you encode different feature interactions? Well for medical features, again you cannot add different mathematical operations. You also can use a clustering or let's say 10 years to create some features for you. From if you have a pair of categorical features you can use them, you can combine them together, treat them as a new category level, and let's say you have a pair of category and you just include using any provided scheme as well. If you have a combination you would like to include the interaction between categorical and numerical feature. For each categorical level you can actually calculate different statistics of an numerical features like a mean, medium, standard deviation. Usually, mean and standard deviation is quite helpful. Mean, max might be, but it depends on the problem you have in hands.

Feature extraction, that's something given raw data you would like to extract some features. For example, if you have a GPS coordinates you can download the third data set which provides you, let's say you can map basically GPS co units to the zip code. Now in the zip code you can know you can get information about population or any other different statistical information. If you have a time in your data you can extract the exact year, month, day, hour or minute time ranges. If you have a holiday calendar it's actually usually helpful for retail chains you can have it flag, is it holiday or not for example, what kind of holiday you have.

Textual Data

As I mentioned before, a couple of times you can actually tune your numbers like say H into ranges usually quite helpful especially for range. Well for textual data like a classical approach has a bags of words obviously, but we all know that so I won't stop here actually. Of course you can use a deeper alphabet as well, especially you can do a word to that or doctor that in Beijing. The feature of it actually basically right now you don't have to train anything from yourself. You can use a pre-trained word vector and use them as they are in your model. You also can, for example, let's say you have a short documents in your dataset, you can calculate what, you can transform your words into vectors and then have a calculate an average vector and that'll be your doctor vector representation of your data. I think that's it and I can have a question case, thank you.

Q/A with Dmitry

Audience Member:

Question on the left side. Lots of questions on the left, which is great. Are you doing redox talk at H2O world as well?

Dmitry Larko:

Yeah, I do talk about a little different talk actually about the different topic.

Audience Member:

Yeah, if you want to know more about the more about a good implemented talks next week. Monday and Tuesday the computer apart from here and show if you're not signed up. I think there's probably maybe a half a dozen or maybe a dozen seats left. Fully sold out, seventy people are signed up. I'm sure there's a that Q/A, I don't what you guys are doing. Not just Dmitry is speaking again, but there's a lot more exciting speakers, it's really interest.

Dmitry Larko:

Yeah, we're going to have a Kaggle number one actually on our panel. And Mario's Kaggle number three, former Kaggle number one grand master is going to have a talk and be on a panel as well. I'm going to be in a panel and I'm going to have a talk.

Audience Member:

You renovate all the questions, can you go back for once more on that? You should make a G file left out of that. Interact with the feature engineering where you do, how the target and coding is happening.

Dmitry Larko:

Leave on out, you mean? Yeah, yeah, we can have a G file actually that's a nice idea.

Rosalie:

Awesome. Thank you all for submitting questions. If you still want to submit some questions just use the Slido hashtag, we'll be answering them now. All right, so Dimitri, what I was thinking we could do. And hi everyone, I'm Rosalie, I'm our director of community. Thank you all for coming tonight. Dimitri, I figured I could just ask you the questions if you'd like. Yeah, perfect. Okay, so first question for you, what are the cases where it took forever to compute and what did you do about it? Are there specific hardware or configurations so it does not take so long?

Dmitry Larko:

To compute what? Let's see. Oh, okay. Yep. Well depends exactly what we're trying to compute. If you're just trying to compute something like let's say a target encoding. Well in that case I just, usually that's because I wrote a shitty code and I just have to write it. If it's a, let's say I'm actually in a model I would like to speed up, I might switch to smaller model. I might sub sample the data or because I work on H2O, I have access to a cluster. Basically, I can spin up H2O cluster and run on a cluster. That's actually exactly what I do all the time. I just borrow the cluster from H2O.

Rosalie:

Perfect. Okay, so it says here, is frequency encoding similar, same as Huffman encoding? Which is used in communication systems?

Dmitry Larko:

Yeah, I think so actually. Okay. Yeah, yeah, very close. I think that's exactly how I found it for the first time.

Rosalie:

Perfect. How can one decide which is the best category in coding?

Dmitry Larko:

By random chance. I mean by checking all of them and find the best basically, right? It requires some gut feeling, but let's say from my point, from my expertise, target encoding usually prove to be the best. Let's say it's a weapon of choice.

Rosalie:

That was my question by the way, I wanted…

Dmitry Larko:

Oh metric. Well obviously it depends, the metric actually depends on your business problem and trying to solve, not about, it's nothing to do this. The whole setup actually you have a raw data, you're trying to do different feature engineering, you fit it to your model, you check it on your validation set using, you select actually visa business and that's how you got the results. If you give a significant improvement, you apply this feature as well. If you don't, well you try it again and again and again. It's really mostly the metric actually's a business question. Let's say I have very good example then the company actually would like to predict something and at the beginning of the quarter there was actually okay if you over predict or you under predict, but as soon as you're getting close to end of the quarter, they over prediction actually cost them so much money.

So they would like actually to be smaller but that requires you to design a metric yourself basically. But again, it's nothing to do with a feature engineer and you can use the exactly the same feature. You just see how your model actually fit to the metric of choice basically. But mostly what you can do actually what let's say if you have AUC as a metric and if have a binary classification and AUC as a metric. In that case you can encode your categories and immediately actually check it against AUC because obviously that's very easy to do. That's something that can be done. Same goes for classical metrics. If you have around RMC for example, you would like to include your category. Your category in coding, especially target encoding, it's a small simple model by itself. Because you're just, yeah and you can directly measure it if you like. But it's not something I actually highly recommend because still in most of the practical cases that even though this simple comparison can be good, the fundamental model can be bad actually. So it's no guarantee.

Rosalie:

Another question from that attendee, they're curious what to do with a feature with a very large categorical values?

Dmitry Larko:

You mean there is then we have a lot of levels. Oh that's basically that you can do apply both technique at the same time. You can do a smoothing and live-one-out. Leave-one-out is extremely expensive but instead of leave-one-out you can do actually cross validation. You just split your data in the five chains for example. You try to encode them using four chains and apply these found in encoding actually to a fee fund and you repeat this process for five categories. Obviously, the fastest way just to apply a random noise which could be a replacement. But in that case you have to design that carefully. I mean I didn't include random noise here just for one reason. I don't have actually an understanding actually what's how exactly actually doing what kind of noise to it, what's the rules actually if I did noise. That's why I don't have it in my slides.

Rosalie:

So many great questions coming in. Thank you all. Another question, it's claimed that neural networks will make feature engineering obsolete. What do you think and why?

Dmitry Larko:

Image recognition in speech recognition, in what else? In textual data maybe? Yeah, definitely. That's exactly why we, basically as soon as you have unstructured data like image, sound, definitely that's the best way they can design features for you. If you have a structured data set, the data set, we work on database, you have a database, you have a lot of data which is structured not so great actually. Usually, they perform poorly compared to a tree base methods. Especially, if you do methods empowered by a smart feature engineering, you still can actually use them the neural net from the feature, I mean on a structural data, but it still requires you to carefully prepare the features for it. It is an active failure of research I'm pretty sure, but it's nothing like smart actually still we just waiting until for final results.

Rosalie:

Awesome. Any techniques on feature extraction for time series or historical data?

Dmitry Larko:

Yes, a lot of them actually. But most of you I the things you can do is different features That's obviously what you can do with your time series data. The key to success here is a carefully apply your validation schema. Especially, if you, let's say if you would like to predict something like two weeks ahead for example. In that case, given the data you can't use your last two weeks of your data set. You actually, your features should be created on the previous data, which. The key here by your validation set you're trying to model the actual real life scenario. That's basically, yeah, that's the most, but features usually exponential is moving in is very helpful. I just trying to, that means your most recent events actually more important than the previous events. Basically that's it. If you given the tree based methods of course, and of course you can apply different like technic like <inaudible> on, but yeah, it's not that fun.

Rosalie:

Awesome. I'm seeing, will you post a presentation online? Yes, we will. You can find that on our YouTube in a couple days. What are the methods of encoding a list of elements?

Dmitry Larko:

That depends if list has a sequence, I mean if they order it somehow you can just create them as a separate columns. If there is no actually sequence in it, well obviously you can one hot encode them. Because it's something that usually you do in the marketing, that's how you actually deal this kind of events. Usually it's a very common task in marketing when you given the set of events, you're trying to predict user behavior or let's say given the user click events. You're trying to predict something out of a user and no let's say I do have a couple of ideas but I need to be checked before I can share them. But obviously it's not very easy compared to the given problem.

Rosalie:

Awesome. Can we pile up all these methods together and run the model and find out the best one?

Dmitry Larko:

That's something you can do. Let's say it's more like a, yeah, you can do that. Although, usually it's very expensive. Say you have a data set which has a 500 columns, you would like to know, I mean you found useful. And you have, if some of them are categorical, you have to include them using every given categorical scheme might show to you. And then you have to apply let's say feature selection methods.

Which could be let's say feature elimination usually, or you can fit a linear model and use an L one organization to get the feature which has aids. But no, I don't think it's a good idea. Usually if you just do it it rightly you get the better results because the feature selection methods which available right now, they're not very reliable let's say in terms. Like, hey, I found a 10 features out of one hand. That doesn't mean actually found all of them. You might still have some, a random features inside, you can still have some feature actually missing. I wouldn't recommend that actually. Usually it's careful one by one feature by feature's usually providing more stable solutions.

Rosalie:

Great. Can you do target mean encoding with H2O Flow?

Dmitry Larko:

No. We can do it with our product, which I forced to work on. I was actually instructed not to mention it.

Rosalie:

Another great question. Thank you all for the questions. I see more just coming in. Have you come across examples where feature engineering was applied in physical security, especially in a domain like banking?

Dmitry Larko:

In banking? Yes. Let's say AFRO detection. Anomaly detection usually apply and usually a different tool set to do the same stuff. It's more like, we should have actually invite <inaudible>. He can talk for days about that. I'm pretty sure.

Rosalie:

I will make a note on my Trello. Question here. Let's see. Is there any good feature encoding tool, for example, a tool to help compute target mean encoding?

Dmitry Larko:

Yeah, I think there is some actually scripts in the wild for Python. I don't remember the links, but obviously you can find this already Script for Python for sure on Kaggle site. Just Google it like go Kaggle and Google, target encoding. You also can try best just, I like search on GitHub, do exactly the same. Just Google GitHub for target encoding can pretty sure you find something. I might, let's say if I found the package I'm reference or I will post the link on the meter on a page, on a meter page.

Rosalie:

Awesome. How to deal with complex objects having partially unknown attributes?

Dmitry Larko:

Can I have an example? What exactly meant by that.

Audience Member:

For your users? Right? And we don't know that, ok, so user is a person or an H and then....

Dmitry Larko:

Oh you're talking about user mission, right? Basically say we don't have some information about user's mission.

Audience Member:

For some users we got information for users we don't,

Dmitry Larko:

Yeah, let's say we don't have an H for one particular user we have forever. Well it depends You can apply a different I techniques to do that. Obviously, the simplest one you can just replace values by mean, kind of stupid but it works in some cases. It also depends, sometimes your missing values can be due to the mistake. They're not supposed to be actually empty, but they're empty. It's a mistake in your data gathering process. Sometimes missing value actually can mean something. It can be an information, hey, value is missing. There could be some reason behind that. In that case you can actually leave missing values as they are. And basically modern technique, modern methods like boost can actually handle missing values pretty easily. You can just pass it and I think it's usually on each split decide where exactly to pass all the missing values depending on the loss function. And it goes through the tree and basically on each split it side left right, he put all the missing values.

It's a pretty powerful technique. In case of encoding Kaggle, describe let's say for categorical, just treat the missing values as a separate category. And same goes for numerical values, It's a separate thing. But I would try, you know what I mean? One of the rule of thumb you replace your missing values for numerical you replace is a mean and you also add a feature, zero binary feature. It tells you like, hey, that was a missing value, actually zero and one. Basically, instead of one column you will have a two. But usually that's a good starting point, especially for neural net and linear models because you keeping the both information you replacing with machine application in that case you won't go through normalization schema. But you also keep the information that was actually missing value there so it's usually very helpful.

Rosalie:

Awesome. Do you prefer certain models for categorical data such as random forests?

Dmitry Larko:

Example, boost mostly, but yeah. Again, and it depends, right? Because let's say in case of the events like you have a list of events. Obviously, that's like what I could do immediately I'll just replace everything as a one hot encoding can. Basically, if you it in a model to see what happens. And usually, because events is a very sparse and a very wide space, that's enough. It's already provided to you a very good model.

Rosalie:

Any other questions? I think we're good on questions here. I was actually just going pass the microphone around if anybody wanted to ask any questions. I mean if you do, just raise your hand and I'll come on over. All right.

Audience Member:

You were talking about combining categories into one, if they have almost the same predictive power. It makes sense if your model is completely statistical, but does it mean that you can combine them by the same prediction power only after you made sure that your features are independent? For example, with some primary analysis.

Dmitry Larko:

Yeah, let's say given model because. Alright, so basically that's what the model actually sees, right? See in the data. Basically, because I always see the same rate of evidence number. Actually the model itself has already combined them. For my, I mean there's no way to distinguish the say is it point, there's this 0.4 value for A category, it's 0.4 value for B category. It's already done basically. But it can be a good idea, let's say it can bring you a good insight for miracle features mostly. Because you, that shows you you can actually combine the beans but although, let's say it does make sense to combine enabled beans, but if have say a bin one and bin 10 with the same way of evidence, it's kind of strange to combine them obviously.

Audience Member:

Like natural for them…

Dmitry Larko:

Ideally, yes, but even let's say this March on GTC conference there's a paper about how you can actually predict the anomaly using neural nets and that's exactly what they did. They binerize fan numerical features, but they force neural net to always keep the same order of them. They try to learn embedded, but if embedded, let's say oh for example, it should be an extended order. If embedded actually right, they replace, they just shift the weights so all the time. It's kind of strange idea, but they always force numerical values to have the same nature order as they're supposed to have.

Rosalie:

And we're live , all right, who is next? Right here. Okay. I'll have you go first and then I'll have you go.

Audience Member:

When we are using frequency for some feature and then we want to apply the modeling predict, that then comes the real input. How do we get the frequency there?

Dmitry Larko:

During your training you learn the mapping table. If you have a test data, you just apply this mapping. You just this look up table to the test data. Basically, these data is never changed during the interference process. You learn this frequency out of your trained data. If you have a new data set, like a test data set you or you try to predict, you never calculate anything, you just use the found values. This immediately brought the question actually, what if I have a new level right in this data set? Well you can apply missing values if you mention machine learning model able to handle missing values. You can just say zero because you never solve this data in your, and you join your training.

Rosalie:

All right, so I'm coming up here for you.

Audience Member:

How would you compare the H2Os use of categorical variables, which you can learn directly from compared to using featuring encoding?

Dmitry Larko:

Not sure I'm following your question.

Audience Member:

Well, H2O supports categories as inputs.

Dmitry Larko:

Yeah, but it's under the hood of just encode them using one hot and encoding. Or it can encode them or it can do one hot and encode then just do a dimensional reduction. Just done, you know, never see it by yourself. And the third way it can do a label encoding. That's all H2O does inside it. You just don't know that. But that's it.

Audience Member:

Okay. I just never quite figured out what they did with deep learning.

Dmitry Larko:

No, we don't do and Kaggle anyway, so that's what you're referring to is, hey, we have a lookup table, why don't we just learn the dates? Yeah, there is a common technique in neural note, but H2O depend does not do that. It still do one hot encoding, but it can do it on the fly.

Audience Member:

I think for trees it does use the categorical variable directly?

Dmitry Larko:

No, not directly. Still it's you have to have some numerical representation because that's how you're able to build a histogram on it and use this histogram to find the split point of it. There is one approach which trying to use a categorical values directly. There's a company called the Yandex and they recently released CAT boost decision trees and they stated this CAT boost actually able to handle categories as they are like raw categories data. But I didn't check it myself. Results are kind of controversial in some cases it performs extremely good. In some cases performs like extremely bad nothing in between.

Audience Member:

I have a question. You mentioned gradient boosted decision trees or regression trees. I have a question which is probably, it's not directly related to this lecture, but I think it could be of interest to many people. For example, when I worked on, when I used to work on query URL ranking, then gradient boosted regression trees, they worked quite well. And recently, actually about couple of weeks ago, I tried to use boosting this kind of boosting with deep learning in computer vision. Basically, what I did after we trained the network I did the inference on the whole train set and those pictures that did the worst results, they were prioritized. They will randomly selected more frequently and this didn't work at all. So what is in your opinion?

Dmitry Larko:

There was one paper actually about how we can boost CNN. You can Google it, an archive itself, I think it's called boosted CNN on something. They tried to learn, there was a specific architecture of a try to learn first of all, but they get a pretty good results. But I try to repeat them what I'll kind of. No, it was pretty good. But yeah, I wouldn't say it's okay. It's the best of the class. There is another technique you actually might find helpful that I think it's called online heart example, mining. That's something you can actually try for yourself. Idea is quite simple given the batch, you calculate the loss, you have a gradients. But you passing view back prop only the gradients of the examples, which was the hardest one. Not all of them, but just the hardest one. But of course there is a question how actually to define which proportion and all this stuff. But I haven't read the paper, I don't have the details. It's just not an idea.

Audience Member:

But the difference between you, what you are saying and what I was mentioning that I try to determine the worst ones by just full inference. But what you're suggesting that we are still running it in back propagation mode, which means they have a drop and things like that. The loss is not real loss.

Dmitry Larko:

To me, you know what I think if you ask my opinion, which is actually doesn't support by anything, just my gut feeling. You see in boosted trees, each and every tree are different. Basically, it's shaped differently depending on what exactly's trying to predict. In neural nets, you always keep in the same architecture on each step. It's the same architecture just rates. I think that's the biggest difference. It just, yes, it's a different model because you learn in different weights, but it's not something that define actually a different model for, in that case it just, that's my gut feeling. I don't know, actually it's just because I did exactly the same as you did. I just try, hey, what if I just fit a neural net in a inside the booth in schema, Which is kind of easy to do in actually. You just apply method and you done nah, didn't work at all. I mean the first one is good and that's it basically.

Rosalie:

Any other questions? If you raise your hand, I'll bring you the mic. Anyone? Okay. Let me check slider real quick to see if folks maybe use that as well. Let's see here. Okay, I think that's all the questions, Dmitry. Awesome. Thank you so much for your talk tonight and thank you all for coming. Thank you.

Generative AI

Predictive AI

On-Premise Platform

Managed Cloud

Hybrid Cloud

Industry Solutions

Use Cases

H2O.ai Hospital Occupancy Simulator

Strategic Transformation

View All Case Studies

FINANCIAL SERVICES

TELECOM

ENERGY

MARKETING

Partners

Resources

Open Source

Join H2O University

Support

Events

H2O.ai Wiki

Responsible AI

Company

Submit AI 100 2025 Nomination

2025 Gartner® Magic Quadrant™

H2O AI 100 2024

Feature Engineering with H2O

Talking Points:

Speakers:

Read the Full Transcript

Feature Engineering

What is Not Feature Engineering

What Makes Featured Engineering Hard

Target Transformation

Feature Encoding

One Hot Encoding

Target Mean Encoding

Target Mean Encoding-Smoothing

Leave-One-Out Approach

Numerical Features

Feature Interaction- How to Find?

Textual Data

Q/A with Dmitry

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights