Automatic Feature Engineering with Driverless AI
Dmitry Larko, Kaggle Grandmaster and Senior Data Scientist at H2O.ai, will showcase what he is doing with feature engineering, how he is doing it, and why it is important in the machine learning realm. He will delve into the workings of H2O.ai’s new product, Driverless AI, whose automatic feature engineering increases the accuracy of models and frees up approximately 80% of the data practitioners' time - thus enabling them to draw actionable insights from the models built by Driverless AI.
- Quotes from Data Scientists
- Driverless AI Model
- Feature Engineering
- Label Encoding
- Driverless AI
- Q&A Session
Dmitry Larko, Sr. Data Scientist, H2O.ai
Read the Full Transcript
Okay, so,thank you again. Thank you for joining us. My name is Dmitry Larko and I am going to talk about the future engineering and the ways how we can automatically like wise complicated process. So,sorry, let me start from introduction. Introducing myself. I am a Kaggle grand master. I'm currently rank the top 100, 100 on Kaggle. Uwell I do Kaggle for last five years and kind of stuck in this process. So,basically one of the things I learned from,Kaggle competition is how important feature engineer might be. The first slide in our presentation was actually about like, I was going to find some quotes on the engine, how important it's and actually I was quite surprised how many famous data scientists actually share the same vision about how important feature engine might be.
Quotes from Data Scientists
So technically speaking, I have to check the quotes from famous people who actually thinks, I mean, I'm forget the Feature engineering process and I cherry pick the three guys like Andrew Ng, Pedro Domingos, and Marios Michailidis. I would say these three guys actually define myself as a data scientist. So thank them for that. And so basically a lot of people share the same vision. So what distinguished the good data scientists from the best data scientists and how exactly he engineered his features. And so basically we can continue. So on this diagram, you see like the common, like a machine workflow or machine pipeline that usually happens in the enterprises. So of course we start this different data integration as them go to data, quality and transformation, and we ideally we should get the modern table, which has a features and the targeter would like to predict, and then Goward and process of model building.
Driverless AI Model
And one, finally we get the model or models or assemble of the models at the, as the output of the pipeline. The place of driverless AI. It's the last three steps basically. So we don't do data integration. That's a very challenging and complicated task based its own. Same for data quality and transformation. So we assume you have a well prepared data set, or at least some data set to, to start home. And the goal, this kind presentation is just to speak about featured engineering. And this happens on the very first step. So basically we are not considering the fine tune of the model and model selection. We just talk about feature engineering and what kind of features can be found before the process.
So what is a feature engineering from my point of view, that's a, that's a very simple example, what kind engineering, how important feature engineering might be. So as you can see, you have two different classes into these spaces and, and into this space, and these two classes are not acceptable using any linear classifier. So you can not draw a line to separate these classes. You have to and basically right now you have two options. You can implement a more complicated model, let's say for example regression 3 or random force, or you can change the feature. So you can actually use a linear model instead. So, and let's say in this case particular example, that's really possible. So let's say if you present your points instead of Cartesian coordinates and polar coordinates, these points will be very easy to separate acceptable by linear classifier. So using the right features helps you to build up simpler models or models, most stable or models, which be easier to or you can help your model to approximate easily to the current problem.
What is not feature engineering from this point of view? So, as I mentioned before, initial data collection cannot be seen as a feature engineer, a target variable and creation of target variable. It's extremely important step. It's basically defines for a steps. So, but it's not the feature engine by itself. Usually it's guided by selected target viable and selected metric and selected that could be this problem. So duplicates missing values, mislabeled class, its data, cleaning is a key. It's again, it's a very extremely important step, but it's not part of feature engineering. Scaling and memorization. That's quite it's a very common step, especially for neural nets, linear and linear models. But we honestly, I don't think it's a, like a can be seen as a feature engineering by itself. Feature selection. It's a, again, it's a support problem that can be and should be actually done after all like all on top of your feature engineering process. But again, it's not the scope of our current presentation.
So basically feature engineering cycle looks pretty much like this. So you have a data set, you try to come with some hypothesis set or create some hypothesis. You validate them and of course you apply them and the process is highly rated. So as soon as you find some useful features, you have to repeat this process from all these using the founded features, all the steps from the sketch. Again, it's up to you, your behavior. So and of course, how exactly can you basically before you can start teaching, you have to have some candidates, let's say some obvious transformations you usually, or some obvious engineers you do. And how you can get this set of hypothesis to test, to start from. Well, of course like the domain knowledge, if you have any about this particular problem, that's extremely available source of information, your prior experience, how exactly you handle these type of data.
And previously that's, again, a very valuable source of information. EDA Explored data analysis. That's an, our way to visually go for the data set to be, and to see some feature interactions or like, or get some ideas, how you can engineer your features. And of course, I imagine model feedback. For example, you can monitor the base for linear regressor, or you can check the splits and three to see how exactly features interacts to each other, and which features actually the most important onesHypothesis validation. So as soon as you generate some features, you have to validate them how good they are, how, and how reliable they are. And that's a very interesting process by itself. Now, of course you can do it in consolidation manner. You also have to pick up. So you also have to pick up the right metric. And of course you also would like to, you need to pick up a right measurement model, you feed this, your feature engine with your engineered features. And of course, because it's a very iterative process, you always have to think about how to avoid leakage, because as soon as you, as soon as you continue to check your hypothesis over the very same validation set, you eventually overfit to this validation set. So you should be very careful on how exactly you use an validation set and to avoid leakage.
So, yeah, and that's one of the reasons why feature engineer is so hard. So basically it's a very time consuming. You have to run a thousand of experiments. And because people unfortunately at least for myself as true, I cannot visualize like in my head and dimensional data sets. It's very hard to came to some good feature interaction, feature engineers, like on top of my head. So it requires a lot of basically experiments just to try to find reliable features. Also one course, obviously, if you have the manual, it always, you is you should always encompass coming in into your data set again, and some powerful feature transformation like target and coding can introduce leakage to your, to your data. So you should be very careful when applying it in a real life scenario.
So again, why feature engineering? Well, it makes your model simpler eventually, of course, let's say you can. It's okay if you're not using any feature or any engineered features it just supply, I would say, boost to GBM model to your own data. You can get a very good predictions anyway, but in some cases it can acquire you to build, let's say a thousand tree or a thousands of trees to get the same results. If you can, which you can get just building a 100 trees and using more smart features. So basically the right features and the right feature engineering and validation process. That's what defined cargo videos from the, from the rest. And the key elements here will be target transformation, feature encoding, and feature extraction. So of course it's, the naming is quite unstable. So, but let's start from that.
So by saying target to data transformation, I mean, actually in some cases, especially progression task, it might be a very good idea to transform your variable to make it more normal. Look, let's say to make it more close to normal distribution to a bell curve. So like for example, log transform or apply like log base of two or 10 of be might be quite beneficial, or actually even applying a signal function to your, to your, to your target function might make it more like a normal distribution lookAs one of our example, but from one of a cargo competition, if you just transform the data set, let's say in this load, you can see how transforming the target variable impact on the behavior of a different boost models. So on the X axis, you see IDs over randomly selected different boost models. And on Y axis, you see the genius score, basically the metric like competition metric. And as you can see, if you apply a lock lock, 10 transformation, the results become more stable across the different param settings for boost.
So most of it algorithms actually, they require you to provide an numerical values. Some of them can handle missing values. Some of them cannot, some of them require you to normalize their data. Some algorithms can handle the anonymized data easily, but mostly all of them requires you to provide numeric values. So that means you have to somehow, somehow Incode your categorical features into numeric space, right? And one of the easiest way, and one of the simplest way to do it's a so-called labeled encoding. Basically, it's a very, a straightforward and simple solution. You just map your categorical levels into the integers. Let's say you just sought your categorical alphabetically and apply for a, you apply zero for B, apply one for C apply two and et cetera. This is a the downfall of this approach that you introduce in some sort of order in your data set in your categories, which is not exactly taking place. It's might be okay to do this kind of ingredient for grad boost methods, because even in that setup, GBM can handle data quite good, but it's un preferred approach just because you don't incorporate any domain or previous knowledge to your, to your categories. Wide known approach is one so called One Hot Encoding. So then you just transform your categories into individual binary binary features. And it's very like wide used for linear models for neural nets, et cetera.
So as I mentioned before, the example of label encoding so you're basically encode the feature a into the zeros, the feature B code it as ones and feature C is code as twos. For one hot encode, you have a asked three columns added. So if features equal, if feature one equals a, you have one, if it's not at zeros, if features one equal B you have zero for a and one for B. So basically you create a very sparse metrics and very sparse presentation. That neural nets and linear models can handle it easily are three based models like random force and or GBM. Don't appreciate that too much, basically, especially if you have a one hundred plus levels in your category feature. In that case, amount of columns becomes too big for random and basic kind of homogeneous columns and random three based methods like random form doesn't like this kind of, uh, feature in column.
What also you can do, you can include your features as frequencies. So basically what I mean by frequency, you just count how many times your feature in you encoded this category level in the data set, and you just divide this for amount of those new data set you have. So basically that you can see that as a probability to randomly pick up this particular feature out to the column or this column basically. And because feature a, and this example you encoded feature a four times out of nine, that means you encode is a 0.44. B is a three time out of nine. So it's a 0.33 And C is a quarter C category level of just twice. So it would be 0.22 frequency. And probability of meeting this feature in the data set a 0.22. That's a quite robust approach again, because you don't introduce any leakage, but let's say, what if you have a balanced category levels in your dataset? Like yes and no, for example, and they both like they both exist like in, in half of your dataset. In that case, you have a frequency of 0.5 and that mean you won't be able to distinguish from them anymore. You have a calendar is a constant failure.
So the target encoding is a very, very widely used on encoding. And basically what you do, you just encode your feature using a mean value of the target value of your outcome. In this particular case for, as you can see for, for feature a, you see a positive target value three times and negative, which is zero is just one. So that means there is a point 0.75, a probability of meeting once in a fee in this particular level. For example, if we see you always have once, that means we are almost 100% sure you always have once. And see if you see the category level. See, so a downfall of this approach that might be if you have a high category feature. So if you, if your feature has a lot of category levels and some of them actually quite rare, you introduce some sort of the leakage in your dataset.
So instead of actually create a general pattern, your model will just memorize. Let's say one, this example it's see, always produce once. So there is nothing more to learn, basically it's enough to be able to produce a very reliable outcome, which might be not the case in an actual life. So to avoid this kind of overfit and leakage, you can use leave one out schema. So basically ideas, quite simple. Then you predicting the, let's say for a, for which highlighted in green, which has a feature a you would like to predict the mean outcome using, using leave-one-out schema. So you just exclude this particular role and use the rest overall this A's to, to, to calculate the mean outcome in this case for green, for green row, it'll be just one, because all, all others A always has one as an outcome. In blue case for, for, and blue, which has a feature B the leave-one-out came to just 0.5 because the rest available columns have the same amount of ones and zero. So it's always 0.5.
That's a very stronger that it really helps you to avoid the, like a direct leakage, but also it might be a very good idea to use some sort of bayan motion. And by bayans motion, I mean, a very simple idea here. Instead, replacing each particular level is mean this level, you actually calculate a weight average for the mean level, the level of the meaning of the level, and then overall meaning of across the whole data set. The trick here is how you calculate exactly the dates for these two parts of your, of your equation. So, and basically you have to introduce a function Lambda, which actually depends on how many how many records for this particular level you have in your data set. For example, it can be ignore it like function. So, and here for the blue line, you, I have an inflection point around, around I think it's two. Yeah, it's two actually. And the steepness function is quite high. So basically as soon as we have more than, let's say seven examples, we always use the, you the categorical level. If you have less seven examples for this particular categorical level, we use a weight average is this rate better coverage between like you of the level and you of the overall data set.
So in this particular example, we have origin of 0.77 and here, how we can actually, we change our encoding. So basically again, we help our future model not to rely on the outcome of a particular level. So basically we're trying to smooth it because sometimes we, we, we avoid our model to be 100% sure on the prediction. Like for example, for C level here, even if our one code make it one, because we have just had two roles with the level efficiency we can, like, it's not as statistically significant, right? We cannot be 100%, 100% rely on this information. So it's with a smooth, it's downs to one to 0.88. And actually, let me stop here. I would like to run something for you. So basically I'm going to run driverless AI on some of the cargo data set. It is two signal rentals, data set, and in test level, it's a, a free level variable. We would like to predict I'm using log loss as a, our score scoring metric and fixing random suits or results will be repeatable. And yeah, let's, let's keep it working for a while. Oops. Sorry.
All right. So what else can be done? So that's, that's which as you, I discuss the things that can be done to categorical features to transform into the numerical ones. You also can do some feature in coding, numerical features, although it's not required because usually your algos, they actually expect you to provide the numerical features because previously we just discuss some coding schemas that can, that should be done to categorical features just because you have to transform them into the numerical numeric space. These kind of a and quiet, optional, although it could be extremely beneficial, especially if in numerical features are quite noisy. So for example, you can bin them using quant tiles or histograms, and the quantiles we have, and we have a population of the same size in each bins and histogram stand, the bins are the same size, so it provides a different be and what you can do with a found bins.
So basically you can replace an numerical features with a bin statistics over a particular, bin. Like, let's say a mean, median, or standard deviation or maximum minimum, anything you like. Or you can treat the found bin ideas, a category level and use any category clinical scheme of we discussed previously. You also can think about doing like a reduction techniques like SVD and PCA numerical features. That's extremely helpful if your feature share the same nature of course, or usually share the same nature. Like for example, if you have a, a bag of words as your numerical features from given, like from text. In that case, applying CD or any dimensional integration techniques can help you to build a, a more stable or even another model. A cluster trend can be seen as a feature in numerical features. So you just apply a K means with a specific number of K of with identify the K clusters in your, in your, in your data set. And that again, can become a new ID, a new categorical feature, or actually you can end up splitting up your data set into a three different, for example, in two different parts and keep the separate models to each the parts. Also might be extremely helpful to find in cluster. And instead of replacing these cluster IG=D, you actually can calculate a distances to cluster centers for each particular role in your data set. And that provides you quite useful information for your feature model.
So again, future interaction and these are again a thought example how future interaction might be useful. So there's a very simple function you would like to approximate random force. Eventually if you run your random force for a lot of trees, or in this particular case, the depth of random force was limited by four. And I just built the 100 trees. Let's say, if you don't have any limits on the random force and you build a thousand trees, you're approximate this go for like very well without actually applying a new features. What I would like to show by this example, by providing the directly the some founded directional features you can approximate quickly and built on like a more smoother surface. If you compare to the fact that you don't provide like some the interactive or some like some interactions to, as, as, new features. In the particular case, I just added as a power of two for both variables I have in my data as the new features and the curve become like an, our square score was actually significant, quite better.
I wouldn't say significantly, that's a very simple task, but it was some improvements. So the question is how exactly we're going to find these interactions? Well, again, the main knowledge is like a common source of a, it's like a silver bullet basically for cases. And unfortunately for a lot of cases, again, we don't have domain knowledge, or actually we would like to acquire some domain knowledge by, or some insights by analyzing this particular data set. What else can you do? A genetic programming approach usually is helpful to find the mathematically mathematical like feature or interactions, like in this case, this task can be solved easily with a genetic programmer. Also, you can, encode some of feature interaction using ML algorithm behavior. For example, if you analyze the three splits in your favorite GBM algorithm, or if you just like have a look into your linear regress rates, you can identify the most important features.
And usually if you trying to model some interactions between the most important features, you can find a very good features as well. So that's immediately got us to another question. So how exactly we model different interactions? Well, again, as, as I previously clustering and K and m from numerical features can be seen as some sort of interactions. It's a distance based interactions, which is usually quite helpful. Also as we discussed previously for categorical features, you can use a targeting quote for pairs, and basically you take the, a couple of features column in your data set and treat them as a unique category level. And you encode this unique pairs as a, and you treat these unique pairs as a new categories in any, any new features and you encode them, you can't target or any type of encoding we discussed previously. In case if you would like to model interaction between categorical features and numeric, you can actually can encode categorical features by different statistics of an numerical features themselves.
So basically you for each category level, you just count, let's say a mean value over an numeric over given numeric feature or features. You also can calculate standard deviation or minimum maximum, or like any metric you would think might be useful. Usually mean, median and standard deviation are quite useful. Minimum and maximum are not that informative, but again, it's, it depends on the data set and the problem you're trying to solve. The last high level topic, I would like to mention that's a feature extraction by itself, because sometimes in your feature you can have hidden features and you would like to extend extraction manually. One of the easiest example as a zip code, for example, zip code is a five digit number, but technically speaking, it's I think you can split up into three numbers. The first two digit, the next two, and the last one over the next three are not quite remember.
So but still it's instead of like having a one, a categorical feature, you can have a, or you can end up is free. Same goes visit time. For example, you have a date you actually can extract the time of the day. You can extract the day of month, day of year, big day and cetera, et cetera. Again, as mentioned before, numbers can be, you can turn for example, age the ranges, it's up to you, how exactly can be done, but the result of features, which hidden inside of existing features. Obviously if you have an unstructured features like textual data, you have to start with some meaningful feature extraction beforehand because you cannot apply any machine learning technique before you actually match your data or extract features and in a meaningful format. Like a classical approach, which has been here for years, it, you can do, you can do a bag of words presentation.
You can calculate the FADF based on your voice occurrence. You can split the document into sub you can apply Ingrams. You can do stem in, you can remove stop ports. So basically it's a classical way how to take textual data. You also can do like a more fancy stuff using a deploy approaches, like word to work, or actually to be honest for text data. I don't think word2vec is that useful doc2vec. So when you just represent you the whole document into the some, some vector, that's something that's really, really needed by, by my by business. But the easiest way to do that, actually, you let's say if you have a, and it's kind of work for short descriptions, like tweets, for example you can encode all words pertaining to doc2vec, sorry, into work2vec space and just take an average and take this average as a, as your doc, as, as your document factor. So basically that's it for the presentation and let me check the what, where we are on the, okay. So
As you can see, I'm going to show you like a very short presentation. What we can do is driverless AI. And basically if you're not familiar on the left lower corner, you see the performance graph, the performance plot, how a performance, how long actually perform against the validation set. And because it's log loss, the lasted better. We have some improvements already. In the middle, you see a variable important graph taken from boost in our case. And you can see we not found like a golden feature yet. We just use our raw features as our best feature as a raw features, which is price. Oh, yeah. I forgot to mention, we are trying to predict the, the interest on the particular add on particular sites. So basically we trying to predict how much interest the, this particular ad will raise on the, on the website. And of course, price is the most significant one, because the cheaper your rent can be the more attractive this particular post will be.
So, and of course we can see latitude and longitude, the location is the most, one of the most important as well as the bedrooms. So what I would like to attract your attention to we'll be able to find some target and encoding interactions. For example, you have a building ID and manager ID. These two IDs actually identify the specific building and specific manager who posted this advertisement on the website and it's a categorical variable. So basically we have to include on somehow and looks like target and coding or the pair of these features is quite available. So we also have some text description text tool features from a description which was found, be quite good. And what we do well currently, it's a very straightforward basically. So what we do is a textile description. We just create a TFID of bag of words representation and apply turn K SVD on top of it to make a, to perform a reduction, basically. So,
Ah, again, that's frequency of addressFor this particular data set, display address is a, can be treated as a text or feature, but in this particular case, algorithm choose not treat, not choose not to treat it as a, as a textile feature, but as a categorical and then encoding frequencies, and as turns out to be a quite powerful feature as well To highlight that it's a, it's not the overall important score. It's not an absolute score. It's a relative score. Basically our first feature, the, the most important feature is good as one. And the rest features I included as a fraction of the most important features call. So basically we see the price is way, much important than anyone else bedrooms are just like a half important as a price. And so, and basically that's all I have for you for today. Thank you for your attention. And we can go to question and answer session.
So we'll go ahead and start with the Q and A. So here's our first question. How do you avoid overfitting?
Yeah, that's a very valid question basically. We score our data set. So basically what we do inside of the outgo We apply different transformation to the data and we score when new found data set on using boost with early stock criteria, basically. So we trying to feed boost as well as possible. We do that like over and over again, using the same validation data set over and over again, which potentially can introduce us. Basically, we potentially can. Overfit the model to avoid that we apply a technique quite similar to reliable holdout technique, which was introduced by Google, I think a year ago. So basically we're trying to measure how statistically significantly change in the score is, and depends on that. We decide to show or not to show the, the new score to the individual. So let's say if the score is not significant yet, I mean, not that significant, we return the previous the same score. If it's the change in score is significant. We return the new score. Based on that feedback, we can actually decide which feature transformation is usable and which is not.
Okay. The next question is how do you decide which features will yield the strongest interaction effect?
So it's basically, there is a two way how we can decide that in, because obviously we don't have, and we don't use yet at the main knowledge and auto gl in the Driverless AI yet. We're going to go to our next steps. So currently we have two source of information, it's random search and the feedback from exhibit basically from our three based model. So well, it's mostly like, you know, like explore exploit problem basically. And by saying exploit it, I see the feedback from ex boost as exploitation phase, we just take an boost and find the and by trace an ex boost model, we can find the pairs and triplets and four way interaction within different features and use an other feedback from boost, we can decide which transformation might be the more beneficial for this particular entry interaction. That's one way to, to make sure we don't miss anything.
We also just do a random feature interaction as search from time to time. That's our exploration phase. At the beginning we're trying to explore feels like as much as possible. Later during the amount of people we're trying to switch to exploitation phase more than to exploration phase. Again, the next, the third question. Yeah. That's something I mentioned my in my slides. So basically our question is, are there auto merges between different data sets? Unfortunately. Yeah. I mean, unfortunately there is none because we don't trying to solve this problem yet. So basically it's a very, and it's extremely hard problem to solve. Basically, if you, like, let's say, given a data table trying to create the dataset using the automatic joints and merges. That's an extremely wide scope, actually. It's an ongoing research, so we're not trying to tackle that problem yet.
So does deep need feature engineering? Well, that's a kind of philosophical question because the main idea behind the different in the, in the current phase actually it's to make feature engineering process automatic, and it works extremely well for unstructured data sets like, like for structured data, like images and text, for example, and sound, of course. So, because in that case, you don't have to create the features by yourself. You just use I features which neural that's fine for you. That not working in the same way for structured data. So, and technically speaking you can, we can use a different model, for example, like a MLP in, in our auto gl and create features, which very extremely suitable for it. But I wouldn't say it's like a, it's a, it's the best approach possible basically. Still unstructured field there are three based methods like boost light GBM, GBM, or any, any provides the best results.
And that's how am we're trying to find features, which, which suits the best for, for three based for boost. The driverless AI to separate two types of feature engineering process, one based on target variable and, and which, which can be performed after data partition and one does not involve target variable. Oh, okay. So the question is actually, that's, that's an interesting question. So basically usually in Kaggle competition, you have a two data set. You have a test data set and train data set. For train labels are available. For test you just have the data, you don't have the labels. So, and given that setup, you actually can use so called semi-supervised technique. So you can use any unsupervised technique on the whole data set. So you just combine, test and train test and train together and create your futures. And you can use a train set to create a features which actually have relied on the target itself.
We don't do that in OTDL. We don't combine train and test just because we trying to create that in a real life scenario. Usually in real life, you don't have a test set available like immediately and, or at the same time during the train time, basically. So in that case, these kind of supervised approaches might be inapplicable at all. So for kaggle competition, I would like to try that use another driverless AI for real life scenario. Now, I don't think it's, it's, it's it can be done. Let's say for all cases I like this question, actually. So can you, can I send you a data set? I make use feature reduction and feature engineer could happen for this particular data set? Yeah, sure. Yes, please do so do actually very nice to take a lot say against the real life problems. That's I mean, that's, I, yeah, I'm more of a welcome to, to like send a new data possible. I would like to try what we've done for what we've done as our product on, on a data set. I give any given data set.
Oh, how do you deal with severity, IMB, balanced data set? Some things like claims frequency in insurance data or click rate in a campaign. Honestly right now we don't have a specific technique for, for imbalance data set that's in our roadmap basically. And you know, actually from my experience, huh, to be honest, I still struggling how to automate this process. So basically a lot of techniques which available right now in, from let's say in a, all, a lot of techniques, which available right now are not that applicable for any given problem. I would say, let's say, you know, like even the common technique, like, Hey, you can just subsample ups sample your your minor class or down sample your major class. Even these techniques can be applied wrong in like, in most of the cases, you have to be very careful how you do that exactly. And so it's it's an area of research I'm looking forward to apply some of these techniques like mode and upside and down something to driverless AI. It's on our roadmap right now. We don't do any specific. So let's say if you have a highly unbalanced problem, like at campaign might be a very good example. Actually you have to do up sample and down sample by yourself before apply a driverless AI later to the automated for you.
How important is the choice of the scorer in this example, log loss. Extremely important. So basically you will get a different feature set depending on what kind of scorer you would like to pick up. And the choice of the scorer, actually, it's not let's say it's a very business driven decision. In my case, in my humble opinion, let's say it's almost a business driven decision. So basically that's, that depends what exactly you're trying to optimize. For example, you let's say, if you care about the order of your predicted roles, you can pick up AUC for example, right? Let's say, especially if you would like to make a cost to potential churn customers, you would like to rank claim somehow, right. To make sure you call like the top one hand, that which like has a highest probability of being shown or of churn in, in, in future.
And that case AUC is like give enough choice or genie because it's rank in numeric as well. In case if you care about some sort of the smoother probabilities, log loss is a good choice. Or you do a multi classification because AUC can be done for binary classification, but for multi classification, log loss will be like my webinar choice. So same goes progression task. It really depends on the business problem, given business problem. For example, if you care very much about the distance between actual predicting and the pre between actual and predicted values, you have to pick up like a means square error, for example. Which penalize the, the, the outliers basically. If you don't care about it, main average error might be okay for you. If you actually care about, if you have a different between over prediction and under predictional values. That might be learning might be I'm not sure, because you, you just will transform your pre actual predict values and that case over predicting phase much more weight compared to under prediction and cetera, cetera.
So it's a question of the business, but it's to mention that you can potentially have a different features depending on what kind of score you pick up. How you dealing with the high dimensionality features as you experimenting all the technique you have presenting are. Yeah, I did explain. I mean, basically the technique I'm presenting, most of them are, has, have been used by myself and my colleagues and cargo competitions. So they usually prove to be quite useful high dimensionality features. Well, again, if we talk about categorical high categorical features, that's I explained that in, in, in my presentation previously, so basically we just do a careful live using one out schema. We do, moving to avoid our feature on this high category features. So basically the, the, of targeting coding you don't have to create a new columns.
You just, you still presenting new categorical features as a one as a one column, basically. That's, that's how, but I, I would like to actually ask myself based on your question, I can ask myself like a bolder question, how we deal with if we have a data set with a high dimension data set? That's a very interesting problem, because again, we, that tremendously increase the search space for us because we have to tackle all features available. We try, we have to try different interaction between them again using the feedback from GBM or boost in that case helps tremendously because we just identify the top features and they limit our self, our search to the top features, instead of checking all of them at the, at once. Also that it's also, we can apply different neural net based transformations, which we have in our roadmap. And we can be, we can, will be implemented like in a month or so to actually to consume like a high dimension data set and provide, let's say a, a vector representation of the data.
How do I get away with the feature engineering? I hear the time that is don't need a lot of feature engineering. They have any examples for yeah, sure. So basically you are right. Does not have a lot of feature. I mean, you should avoid feature engineering is deploy as, as much as possible basically. And it works for, for unstructured dataset data, like images, text and sound. That's why there was a huge breakthrough recently in this particular field of knowledge, but the structural data sets unfortunately at this very moment, it's not avoidable to have a feature being engineered because yeah, you can, you can apply the driverless and that's how you actually get that's right. You can apply driverless and it can find a lot of features for you. So but in again, in structured data sets three based methods are still dinners.
And let's say a 99% of the cases from my experience again, and from experience the cargo competition. I, I think I can recall a one competition, which was a, a mix of structured and unstructured data, and that was won by neural net. But the rest of a competition usually won by three based methods. And for them, you still need the feature engineer again, as I show my example in a presentation, you cannot avoid that. You still can be able to build a very good model, but it might take you more time actually, or more let's say more trees to approximate. Let's say, if you provide a found a very good features, which can be, which would require, let's say just a, a handful of three store, approximate, so to, and have a very same score. And basically that usually means you, the simple model are better usually because the more complex you model, the more fragile it become, and that, that requires you to maintain it more carefully. When will the driver say available for preview better preview? I think it's right now available. Yes. You just can go to the website. Let's say, let me, let me try to find, so H2O.ai, Which one? I've been there actually. Oh, nice.
Just okay. And so, yeah, that's you just sign up for better. We don't record your phone number, which is good actually, but you have to provide your first name, last name, and my address is dropped. We need to learn more about you before we can, we can sign up for better. That's the comment. Yeah. That's you do we have a roadmap slide? We just, I, okay. So basically that's the current architecture of the driverless AI basically. And what I explain you right now, we basically somewhat here in the, in the like box. Also again so what, I'm, what I've told you right now, just it's a small piece of the overall picture. We just concentrate on providing a good features. You still have to find junior model. You still, you should be able to build a an assembly.
If you like to, again, to me, actually, one of the most important parts it's interpretation of the model you have, or you build, because let's say it's a way to build a bridge between feature found, excuse me, by driverless AI and your understanding of this feature, basically, because some, some, sometimes it could be, you know, extremely, I would say controversial, like, you know, the feature found might be meaningless for you for, from a beginning. That means you have to analyze it, visualize it, and this type of visualization and analysis to understand why this feature meaningful in this particular case. Can driverless AI run just on laptop or does it require seven? Well, depends how patient you are basically, you know, to be honest. So yeah, it can be run on laptop. Actually, I do it like all the time. But in that case, and depending on data size, you have to spend more time of course, compared to a server environment, but there is no any, yeah.
Especially because you don't have any like multi GPU or no, you know, you on a laptop. Potentially you can have a multi CPU on the laptop. So, oh, actually, yeah. I heard that. I think one of a company actually released like a two GPU laptop, but it weights like a 80 like 16 pounds or something, and it requires like a separate battery to use. But yeah. So technically you should use a server because. The main goal actually is to reduce amount of time spent by data scientists, you know, finding features. So basically you can run it overnight and find like some meaningful features for you that will be actually extremely helpful, right. For the next day.
How does data say a pipeline load and organized? So basically right now the output of the driverless AI will be a pipeline which contains the old transformation needed. And as output of this pipeline, you get the transformation, the, the data set you can use to model the date. We will be able to provide you the overall pipeline, not just is a data set, managing, or change on transformer, but there's a, a model or set of models at the beginning. So basically I see that as a, you know, as a one stop, basically. You just provide raw data as input and you get the score as an output basically. And inside of this pipeline can be data transformation, step single model, ensemble, stack ensemble, like an end level ensemble or whatever, whatever complex models you would like to have, basically. That's my vision and understanding where we are heading, heading to.
Thank you. That's all time for today, I think. Thank you, the meeting for the wonderful presentation. I think there's a lot of questions we can get to hopefully try to answer them offline. As I mentioned earlier, we will get you the recording and the slides the next year or two. So there are some upcoming events I wanted to tell you guys about. So we have ML cmp coming up on September 15 in Atlanta, and then followed up in November in San Francisco. We have the strata conference in New York. That's end of September. We are gonna be in money 2020 in Las Vegas in October. We are also gonna be at GDC DC in November, which is on the slide. And then finally estro world is coming up. That's what I go to Las Vegas, please. <Laugh> that's in December. Yes. anything for a retreat right. So the, so yeah, do check us out and if you're gonna be in any of those cities, come meet us. Talk to us. I'll love to show you a demo on driverless AI give you a be if you're interested as well. And then yeah. So thank you a lot. Thanks a lot, everyone for joining. Thank you. Have a nice day. Thank
You guys. Bye bye.