Accuracy Masterclass Part 3 - Feature Selection Best Practices
Not all features are created equal. It is tempting to put all available features into the model but if you have too many features or features which are unrelated and/or noisy this could hurt model's performance. Finding best feature subset could be very useful in industrial application:
• it enables model to train faster
• reduces complexity of data pipeline
• reduces overfitting
• might improve performance.
Sometimes, less is better.
That's one of the accuracy masterclass series presentations. And we're going to talk about feature selection today. So basically what I'm going to do, I'm going to run some sort of like, give you a small overview of a topic, introduce of two different methods of feature selections and make a summary and share my thoughts on this. So what is feature selection basically, in general, I like to think about feature selection as, as a as a technique for dimensionality reduction. And why this is important to reduce the dimensionality of your data. Well, basically, you know, even from a simple point of data engineering perspective, right, you can actually, you can actually simplify your data pipeline, because nowadays, I mean, if you pre select some specific features for your model, you don't have to, you know, transfer all the data you can transfer, like a particular features from different tables, and that's one of the points. Well, obviously, you don't have to store, you know, unused data, basically unused feature for your model. Because you can do that means you can reduce the storage space of cost. It has something to do with curse of dimensionality, because the less dimensions you have, the better basically, in general, especially for unsupervised techniques, like, you know, like clustering, for example, from the other hand, feature selection for clustering is a completely different beast, we're not going to cover this in our current presentation. And least, but last but not least, interoperable interpretation of your models, you know, even Fe were much easier to interpret them, you have, let's say, four features, and you have, let's say, flowers. And so that is something quite important to have in mind as well. So how we can reduce our dimensionality, we basically have two techniques, like in general, of course, like, it's one other way to classify it, you can, you can use a feature selection, or you can do a feature extraction. What's the difference? Well, a feature selection allows you to pick up a subset of the features you have, at the same time feature extraction basically creates a new feature representation for you. So feature extraction, nice example can be PCA, for example, or TF neutralization of multi dimensional data into 2d space. So you still require all features to be present, you just use them, put them into some sort of the function, which actually, you know, gives you the nice new representation and ideally in a smaller in a smaller space. While feature selection actually tells you like what features are useful, and you can completely remove the unusable features, we're going to cover feature selection on this topic only, we're not going to talk about feature extraction.
So there is a different way to classify a type of feature selection, one of the ways is basically, you know, you can split them, split them by how they do they use target variable or not. So like I supervised unsupervised methods, that's not my preferable way to classify them, I usually try to classify them using the, their way, basically, the way they basically designed. So and we can actually, I would highlight three different groups. So far feature selection methods, it's a filter and button in the upper. So and filter is a quite simple, it's one of the simplest methods in the in the whole family. And ideas of all filter methods is quite simple. I mean, it's very generic, you just basically assign some score to the feature. And based on that score, you can actually you know, threshold and remove features, which actually below the threshold, or you can rank them and take the top five or top 10 or top and basically. So what kind of scores you can assign? Well, simple way going to be, let's say you can measure the variance to the variable. And based on that you can add more features that have a low variance, you can use it, you can, you can, you can measure the correlation between the features in your data set and use pairwise correlation to remove features that's actually highly correlated to each other. Or you can run different univariate analysis a built in feature on the target and you no matter how feature basically is good in predicting the target at some sense. So, as I mentioned, variance threshold is quite simple right? You basically what you need to do you just need to measure the variance and idea behind it, I features which has a bigger variance contain more information and might be more useful, which is not always the case. Of course, it's assumption. And as always, you know, machine learning you have to know the assumptions behind the methods you're going to use. It's so it's, it's if the true for anything like machine learning models, validation schema, you used it, we all operate under some typical assumptions and sometimes these are assumption does not hold. So there's a pretty strong assumption actually to say no variation that's can be quite important. So what's good about these variants, thresholds and methods? Well, it's very fast, you just need to run it n times, where n is the number of features you have in your data set. Because you know, linear time, good, fairly simple way to measure it. Okay? What's bad about it? Well, there is no way you can count the interaction between features, right, you just need a single feature all the time. And you didn't talk tech with the relationship between relationship between target and efficient accounts. So basically, you just analyze feature by itself, you know, like, without any connection to the specific data set or specific target specific problem you have in mind. Why it still might be useful. It's, you know, as a simple example, it actually allows you to remove constant features from your data set. Because sometimes if you take a subset of the data set, you might end up having a constant for particular values, right for particular features in your data set. And there is no reason to keep them because any much model will come will basically ignore it like for sure. So constants definitely can be removed. And if you just put a threshold of zero to the variance fascicle, that means every answer is going to be zero. And that's true only for constants.
That's a simple example. Basically, I tried to provide a code snippets for techniques I described today. So and basically right again, like for example, in this particular case, we can I know for sure what Emily's dataset which contains our handwritten digits from zero to nine, there is a lot of black pixels, and they can potentially can be removed out to the dataset, although it's not the best way to model the use machine to use the model this particular problem using a tabular representation. And it's still a nice example, what's also important about variance thresholds. Well, it's actually scale dependent, right, so you have to make sure the features in your data set on the same scale, let's say from zero to one, or from minus one to one. And that case, you can actually put all your features to the same transformer and assign the same variance facial, if your features contain different scales, you have any you don't want to scale them, let's say that transform them basically to the same scale. Well, you have to analyze them separately in that case. So the next technique and filter methods, it's a univariate analysis, and in simple words, is basically now is a measurement of how good is a single feature in predicting target, right, you can write you can enter different statistical techniques to find that there is a point there's a whole time when you have such kind of tests and ideas again, you basically measure that target, but now they now in this time you measure the target against the also feature against the target you have. So what's good about it?
Well, now basically selecting features, not just like, you know, by themselves, but actually in relation to the target you have that means it's, it's you measure the feature, goodness in relation to the problem you're trying to solve. Again, it's fast, right? Again, you don't take a relationship between features into that account. And that's going to be a major minus of using this technique, because sometimes the feature interactions might matter, maybe not like a high level interactions, but let's say you have two futures, they can actually, you know, go together, let's say day of the day, day of the week, for example, right, and let's say time of the day. And that case, you know, you can actually extract additional seasonality in your data points.
Also, univariate analysis, in general case, as a lot of statistical tests, they actually assume a linear dependency between the feature and the target. And that means if you have a nonlinear and especially a nonlinear non monotonic relationship between the feature and the target, these tests might be not able to get catch it up. Just something to keep in mind.
So that's one of the examples can what can be how it can be used. And there is plenty of different examples in second or library basically. So what you can do in this particular case, we just download the data set. We scale the data set because for statistics I'm going to use which is a chi square chi square, I need to have a non negative values. I know for sure what annoys me is data actually varies from zero to 255. But you know, to make this example actually complete, I decided to scale it So, which is actually not necessarily for this particular data set. So and what I do basically select k best features out to the data set based on the relation to the target based on which use case statistics.
That's it our feature of pairwise correlation. So what you do basically, you measure the, you know, the feature in, let's say, connection, in this particular example, you measure basically a correlation how two features actually correlated to each other in your data set.
Why, why this is important, actually, this is important because a lot of material out there conceptually very sensitive to multicollinearity in your data set, that means if you have features, which are really highly correlated to each other in the data set, that might actually be that might actually hurt your model, in some cases, quite significantly. So this is something that usually it's good idea to do. Unfortunately, in some examples from life science, it can be really, really slow. And it's really, really hard to do. Like, for example, if you have a human DNA methylation data set, it contains, like half of a million of columns. And you would like to that means if you would like to, you know, straightforwardly measure the correlation and build a correlation matrix, you will end up like, having a mattress with like, four and five millions road versus point 5 million columns. That's a huge matrix. Actually, I'm not, I'm not even sure you're able to fit it into RAM. So and it takes quite a while to calculate.
Again, if you measure correlation, you have to you also have to be aware of what exactly you measure how exactly you measure the, you know, dependency of two variables. In case of correlation, you have to be sure, you have to acknowledge the fact that actually it measures the linear dependency, again, if you have no linear dependency, maybe measuring clearly correlation is not a very good idea. But it's kind of fast, right? Why we usually are selecting these type of stats statistics first, and these type of measurements because they're quite fast. Obviously, you can try to measure nonlinear dependency by fitting a model or maybe like, you know, like, let's say, yeah, like, for example, nonlinear model to trying to predict one feature based on our feature, but that's definitely going to take way much more time to compute. So that's a code snippet, how basically the performance, I was not able to find any library, which, which does it by itself. So it's basically implemented myself in a quite straightforward way. So we're going what we're doing here, we calculate the correlation matrix. And because we don't care about the sign, the, the, the sign of correlation, we just need to measure how strong correlation is we just take an absolute value in that case, again, correlation matrix is A is a symmetric matrix. So we don't need actually the bottom part of the coalition metrics at all. So I remove it. And after that, I basically find columns, which actually has a correlation more than point 95. And that's a column so I'm going to drop from the data set.
So yeah, that's basically the code example. All right, the next family of methods is going to be embedded methods. And the basic idea is quite simple, right? There is some algorithms and machine learning space and approaches, it's actually kind of do feature selection as a part of a training process. A good example may be decision trees, if you let's say, you know, if you have a shallow decision to like, you know, as a depth of two or three, it's going to be forced by the number, you know, it's going to be forced to pick up only two or three features like which is the most important from a data set, even if the test contains 1002 features. If you have a single decision tree, which has a small depth, you won't be able to use all the features available, you have to pick up some of them. So what kind what also, we can actually use? Well, one of the simplest straightforward and quite practically useful approach is just you use the L one regularization for your linear model. So what's so special about a one regularization, it actually so what you do, you're basically punishing model for however, high absolute values and new weights of your linear regression or logistic regression model. That means the model is actually forced to reduce his penalty by reducing the weights up to zero basically, and the zero of coefficients it actually forced to create a quite sparse model. So you can either say if you have like 1000 features, you will might end up actually having a lot of zeros in your coefficients and have like some only meaningful features will have a nonzero coefficients that are so Again, because we have a regularization term which actually straightforwardly punish for, for big weights, you have to be sure your features are on the same scale. That's one of the biggest reason why you use as you know, always pre processing step in psychic learn, is there regularization, in logistic in sorry, linear models, they have to be on the same scale. So, what's good about this approach, it's kind of fast, it takes all the features at once, it's relatively cheap. Again, if your data if your dataset contains multi highly correlated features, in case the following regularization, you might end up the model might end up and that picking up one of the features all the time, or actually can split the ways within the future, it depends on the alpha parameter of your regularization techniques, but basically, you would not have and actually, in some cases, if you run it on a cross validation fashion, you will be able to see that your if you have if you have two highly correlated features in your data set, you might end up switching the weights between them in a few different falls in your consolidation in on 1/4 v is going to be zero in our and this is going to be you know, have some weight in next fold is going to be YC Rs. So it's going to be it definitely doesn't help your feature importance analysis, I plan for that. So you have to make sure you don't have a highlight crowded features in your dataset. Otherwise, this technique might return some confusing results, especially for highly correlated features. Again, quite simple, straightforward example, we do need to scale our features. First, we now as we create a Lhasa model from scikit learn, that's our parameter alpha. If we increase it, we will end up having less features because we zero less feature with a weight of zero in the model result visible with nonzero words, sorry, so the biggest value of alpha, the small amount of feature will have nonzero coefficient, the smaller one value of alpha will get you more features with nonzero coefficients. And basically, what you need to what we need to do after the model fit, you just need to analyze the coefficients, right. And he coupled features which has a nonzero coefficients unless you're going to be pre selected features for from a data set. Again, one of the biggest downfall of this approach is going to be you assume linear dependency, because it's a linear model, you can go beyond that issue, like features, which has like functional linear dependency as a target, there's going to be it might be could end up being dropped from the dataset. Okay, what else can be done? Well, random forest right actually can give us a feature importance of them. So we're going to use that that's actually a good idea, especially because in scikit learn nowadays, they use an impurity based feature importance, and it is quite simple. So basically, on each split your measure of how you improve quality, decrease, basically, right, so I assume that you know, feature, the better the best feature, you're going to improve your score significantly, right? You also that you also have to wait by a number of examples, which ends up in belief, right, because, you know, let's say, if you greatly improve your impurity score, but you ended up with two values in the leaf, you know, it's maybe not statistically significant. So you have to relate it. And basically, that's how you calculate the feature importance for the data set. Now, you can actually extract it from a random forest, pick up the top 10 features, for example. And, yeah, that's end up here. feature selection process. So what's good about this approach?
Well, it's able to capture nonlinear dependencies, which is good. If you have guessed, what highly correlated features in your dataset, they were going to we're going to share the importance basically the importance, we're going to be split between this picture, you know, almost equally in some cases, it depends if your features, let's say correlation, correlation coefficients is one is going to be ended up you know, using one or an hour in every single tree, and it's the feature importance is going to be evenly split between them.
So, if you have. a high cardinality categorical column, and high cardinality means you have a lot of unique values and your categorical column and you want hot encoded, they feature importance for let's say, for the column in general going to be kinda inflated because you now every single, especially 01, hot encoded features is going to be not there. A helpful for more model in general. Now that's yeah, that's going to be one of the trickier how to present the categorical feature in your in your data set. So you can actually measure the importance of this feature and actually model using this feature efficiently. But unfortunately, that's a question for another talk.
So, fairly straightforward. Again, we just fit the model. And because it's a random forest, we actually don't care about the same scale of the features. It's random forests, one of the beauties of this tree based algorithms, you don't have to make sure you have the same scale of features, it doesn't matter for random forest, there are good
fit the model, get the feature importance, solve them based on the on the important score, and pick up the top 10. That's how you pick up teachers, you also can run a fresh coat, right, but if you cannot do it automatically, because basically, you cannot go Yeah, so basically, you reweighed, your importance of the summer is going to be added up to one. And that means your threshold is going to be always depends on how many features you have, basically, right. So you have to if you want to use a threshold, you have to basically, you know, plot, you have to print the table of your of your of your, of your features to understand, you know how you're basically efficient baton score is actually distributed. So a simple idea, basically, I just say, take the take top n minus one, let's say and remove just like the least important feature from a data set, it's all it can be a good way to improve your model, I'll make at least make a small. The last but not least, it's a wrapper methods. So what's good about embedding methods, right, we can they use these, it's already a machine learning model, we can optimize the same metric we're interested in, but we have to have access to the model seen sites like you know, in terms of loan coefficients in Java feature importance, let's say for neural nets is not that easy to do. Right, you don't have a single vector of coefficients for neural net model, you have actually a lot of coefficients, you have a lot of weights, you don't have a feature importance calculations out of the box for neural net. So basically, you have to do it yourself. And that's what wrapper methods are about, you treat the machine in the model as a black box, then you just measure it, you know, the future performance outside, all you have to have access to easily you know, is a model prediction, for example. So that means you can actually, you know, just like, if you have a train model, you just put the features in the score, right, you know, do something these features, for example, it's going to show you this list a little bit later, and, you know, measure the new score. So, it's a more universal technique, but maybe more time consuming.
So let's start small, right, let's say let's start with the recruits efficient elimination technique. And then that's going to be permutation importance, well, recruits efficient management still requires you to have access to the, the model internals in terms of a feature. So feature important so coefficients.
And idea is quite simple, right? So basically, what we need to do, we just need to train the model, somehow get the, the, you know, the measurement of all the features feature importance, you know, in terms of if in case of a new model, you just need to take the coefficients, and you just remove the, the feature, which has a smallest coefficient, you just drop it, and you repeat this process until a desired number of features is reached. So Well, obviously, because you can explicitly select number of features, that's maybe a good or bad thing about this, that depends on your point of view, I think it's a pro. It's not very, it's not very expensive algorithm to run. It's a linear model is used. It's, it kind of takes feature interaction into account right? Because you just basically you're constantly and slowly removing features one by one, you don't remove like a chunk of features at once, because as soon as you remove the least important feature, the whole distribution of importance, for example, in the process might change right and the next step you might remove not the same features, you would you would end up removing in the first place.
Guess what lightweight as soon as we introduce complex methods like I said before, you also add additional meta parameter to tune for your model I cannot say in this case, you have to find the right number of features to you know, which actually you know as a good number, which are features which less than total amount of features, but at the same time you have similar performance or performance close to the car Complete dataset? Well, it's actually very, very easy to implement in scikit learn, because second does have this implemented for you. So it has a specific transformer called RFP. And you can use it, you know, right away, basically. And again, that's a good about that you can actually, explicitly you can interchangeably use linear models and random forests, for example, RFP expects the model to have either attribute coefficients coif, underscore or attribute feature importance. Underscore. So if they don't, I think they have a wrapper so you can actually implement it yourself if you want to. So it's quite flexible approach and quite useful. It also, yeah, I can I can talk about that later. So the permutation importance.
That's actually an interesting approach. And it is quite straightforward, right. So basically, what you do you train your model, now you have your model, right? You take a validation data set, it's always a good idea to take the data set, which model did not see during the training, so it should go to full data set. If you don't have it, if you have on the train data set, it's good too. But results going to be biased in some in one way or another.
So they can in this validation data set, you just measure performance of your model, this validation data set, and then one by one, you shuffle the future, what a single feature, right, and you put it into the into the model and measure the difference and measure the score again, right. And now you can actually compare the baseline score on that original and shuffled validation dataset with a new score, shuffle dates, you are expecting if the, if you shuffle the column, especially important one, right? You're kind of expect, because now it's there is no connection between the shuffled column and target, you expecting this score is drop, dropping of the score. And the more significant is drop, the more important Fisher is for the model. Basically, basically, is gives a really, really similar results to random forests impurity based importance. Not but in that particular case, right, this permutation importance is model agnostic, because I just use what you need to do if you just need to have access to the model outcomes, basically. And that's it. Usually, if I would like to learn if I build a neural net model for my tabular data set, and I would like to measure feature importance, that's exactly how we measure it for neural net. It can be I mean, it's, it's not the fastest one, because you know, because your shuffle column, you kind of introduce the How should they put some sort of the nice in the Indian measurements, so you have to repeat this process several times to get the mean score, but it's yeah, it's still, you know, fairly fast. You also can do column drop variant, but there's going to be, it's going to be more precise, but more time consuming. And again, it is pretty simple. You're trained, modeled on the full data set, right, you measure the performance on the validation set, then activate your drop, cut one column from a data set, train a new model on it mirrored on the same validation, set, trim, drop, drop another column from a data set, and repeat this process as many times as many columns you would like to measure. All you have, yeah, it's more accurate, but more expensive in general, because you have to retrain model on every single step. So again, it's model agnostic, which is definitely a pro, it's, you can actually base it on a metric of your choice, it can be any metric, you want to basically, it's totally fine to have. Their, why it works. And how it works is quite easy to understand, you know, it's very easy to explain to people okay, you know, what we just got, we just put the garbage instead, the variable actual value in the column and model drop score significantly.
So it's actually no, it's very easy to catch up the idea. If we use an independent data set, right, if we use an independent data set out the full data set, this core might actually be if you take a feature importance from a random forest model, and this permutation important score, the ranking of the features might be different, because no matter how we train the model, and Random Forests actually did a great job trying to be unbiased, but it's still going to be biased to the dataset have been trained on so that includes actually feature importance scores as well.
Same problem with a highly correlated features. So exactly this As you should have allocated the features, the importance might be actually the simplest call might be split between them. Good news is already implemented in scikit learn and the inspection sub module, you have a functional limitation importance, it actually asked you to provide the model to use and the data set and how many repeats you would like to do, or the shuffling procedure I mentioned before. So basically, in this particular case, we have 20 columns, we have five repeats, that means we're going to run the inference of the model given this data set when to 20 times five, 100 times. So not very, yeah, not very slow, not very fast. It's, you know, it really depends how many features you would like to import. And so basically one of the one of the problems with this approach, actually, if you really have a lot of features to measure, like your if, like, like an example I provided earlier, if you have half a million features to measure permutation, important stuff, that's going to take a while. And sometimes actually, it's, you can do the other way around, which we'll talk right now. So what would you do in previous approach, we shuffled the future, right? The majority performance, what if we shuffled target instead?
Right? So why not? ID is very simple, right? So we train a model on the data set, given the data set and the target. Now we measure a feature importance, right, that's going to be our new importance, which is original importance, right? After that, we shuffled the target, the target only, and we fit the model on the shuffle target. So basically, we fit the model on a random target, because we shuffle it, there's supposed to be no connections between the samples and the target, because we shuffled and they train the model on the shuffled data set, and we measure the importance of this model on a FIFO target.
If they feature importance of a particular feature on the shuffled target is higher than the importance on a new feature importance. That's a kind of suspicious feature for us, right? So like, you know, if your feature is more is more important for randomly assigned target, why do you need this feature at all? Right? Because I mean, it's, the good feature is actually, you know, really has a connection between the itself and a target. If you shuffle the target is going to be, you know, it's going to, it's going to drop significantly in its importance. If you have hovered around, that's actually a suspicious sign.
We repeat this process n times, and each time we see this, you know, this feature importance higher than middle importance, we just assigned plus one score to this feature. And if this feature is too often has this plus one assigned, we just basically reject this feature completely. This fight is being kind you have to train n plus one model for this approach. Sometimes it's actually faster than permutation importance. In some cases, like, especially if you have a full provide data such as, like a lot of features to measure from, if you have a GPU training versus CPU based inference. Because sometimes, I mean, it's kind of Yeah, I know, it's maybe not very good environment. But basically, sometimes it's faster to train in GPU than just a matter of inference on CPU. And sometimes it's faster to train the model, you know, like 100 times in this particular case, and just measure the mutation importance for all the features involved. Same problem, if you have highly connected features, the model you're going to have the support you're going to suffer from, because the model suffers from, ah, it's kind of the black boxes, because you still need to have a feature access to feature importance, which you kind of can calculate using the previous approach, because going to be quiet, you know, complicated schema by the end of the day. So, I showed the implementation of this approach. So we define the model, we get our null importance after the fit the model. And then we just, you know, we just shuffled our target, we fit the model and shuffle target to get our feature importance. If the feature is we just you know, apply. We compare the null important specificities to this perturbation importance. If we end up having a big number and the result, that's a bad sign, for example, because I run 100 times, if I see these results more than time, more than 10 times I decided to reject these features. It can be quite aggressive, to be honest, right? It's actually yeah, it's but that is an interesting approach. And I use it a lot actually, in some cases, it didn't, it gives me a better subset of the features that I have used in any other techniques.
So summary you can, you cannot have a single technique, and that's it, it's by the end of the day, in some cases, you have to try some of them and some of them can work better than that. Rule of thumb, by features bicycle pre selecting features by using the feature selection, in some cases, but quite rare, you can actually end up having an even better perform better performance model. Just because you if your features has a lot of noise, by removing them with noisy features, you might actually end up getting given like a better model, which has a more better signal to noise ratio, in general case, you actually sacrifice some performance by significantly reducing features needed to train the model. And this sacrifice service. Sacrifice cannot might be not that big, you know, we're talking about let's say, you know, AC from 81 to 84, to from point A to one 2.8. Right. So it's can be quite, you know, not significant, but you can reduce the data set size. And you know, what, enough magnitude in some cases. What's also important to remember about feature selection, feature selection is a part of machine learning pipeline. And if you would like to measure the performance of my your machine pipeline correctly, it should be part of a cross validation or any you know, out of full validation you have. So you cannot pre select your features based on the complete data set you have. And then use the specific the features to measure model performance in the cross validation, you will in that case, you will have a too optimistic measurements of results, but you can't allow information to leak. Because your feature selection method has access to a complete data set to complete a train distribution, you have to complete a dependencies between features and the target. And it might be select features which overall the best while if you do the same in cross validation, it's kinda you know, introduce some noise because you all you always have access to the part of information. So always consider your feature selection be a part of the Muslim pipeline in general, and especially if you want to measure, you know, the impact of feature selection on your, on your machine and pipeline. It has to be cross validated as a as a model. Besides n plus two, that's right. That's right, actually, because the some of the official selection methods do have parameters. You also might consider optimizing them as well, which leads us to cross validation again, so it's definitely a good idea to have it as a part of cross validation in general. That's basically all questions. all I have for you today.