Office Hours with Kaggle Grandmasters
featuring Mark Landry
Join us for Office Hours with H2O Kaggle Grandmasters. Office Hours with Kaggle Grandmasters is your opportunity to ask an H2O KGM your burning data science questions, gain insight into winning kaggle competition strategies, and learn about the latest in data science. Come with questions and curiosity!
People thanks for being on here. So heard of Kaggle and five participated. So Mark Landry's so. So a little bit about me and I guess my Kaggle journey. So the Kaggle Grandmaster turn didn't exist when I started doing this for long time, originally kind of to learn. I'm not even really a data scientist to start.
So I started reporting and data analysis, data warehousing, I think you learned some things from there that maybe helped me. After doing that for maybe seven, eight years, I did kind of start learning more and picking up a little bit of data science and started Kaggle pretty soon. And my very first submission was, as I say, often it was really easy. It was a medical competition. I was in doing medical stuff at the time. And I just use sequel to do it. So as a sequel, group, by statement produced a CSV. Really simple send it in, I was about middle of the pack. It wasn't very great. But it was like, Okay, well, that was easy. Now, if I try, like, what can I do. And so I just kind of got hooked from there. And so there was another medical competition that kind of taught me a lot of things going on at the same point. That was the first one to finish. So when I got a medal in that one, so that was pretty exciting. I started pretty quick. I've done quite a lot of them. This is not quite a current screenshot, it's about the same. So 98 competitions, it's a lot. For a while I would only do about one at a time, it's about all I could handle.
But after a while, you just start to especially if you don't have much time, you can get in one lightly in front of define your own goals and things like that. So that's why the counter competitions are higher than most. But also the metals there equals 30 silvers, I had one of the most silvers for a while making bronzes, I guess. In the type of competitions, I guess I do well. So this is one of our I guess maybe leading into some new stuff. And let me back up a little bit too. So I've been H2O for about seven and a half years now. I was our first Kaggle Grandmaster, there we go, that wasn't even a term at that point. And through some of the people I've met through Kaggle, have, you know, started bringing some friends and collaborators in people I met through Kaggle onto the team here at H2O, and so now will claim to have brought everybody in, and it's just kind of builds on itself, it's become a fun place to be for people to do these early days, you know, Dimitri Larco.
And I could talk about a customer problem and relate it to a similar Kaggle problem. And then we know a lot about that, and we'll talk about that. So I do find even in just in general, for those that have just heard of Kaggle gets really from the basics. It's a data competition data science competition platform, it's the nuances have changed over time, there's a lot of different problems. It's very heavy, deep learning these days, it wasn't so much in the past. But the very applicable, I would say to work world, because they're very realistic problems. And there's so many now you can find them, you can find old ones, you can recompute an old one, see how well you've done for me, I've you know, it's the most fun to compete and to see where you are on a leaderboard. I think it's that gamification effect that that hooks many of us that do a lot in and just watching yourself rising the leaderboard and thinking, what do they know that I don't know? What can I do and just poring over it? And the answer is not out there. And doesn't mean you can't read the finale while you're doing it. So I think that's why I like to do active competitions.
But what I found if you joined H2O, and maybe even one job earlier, like just answering interview questions and things like that, I had so much more experience by doing Kaggle competitions, then you get in the work world, traditionally, its different jobs, different types, of course. But you know, typically you're working on two or three, maybe problems a year, maybe even just one for two years or something like that, doing the product we're doing now for a long time. And so Kaggle gives you just there's a prevalence of problems with all slightly different variations. And as I like to say a lot, it's just you open your eyes, and you look at what's out there and not just like okay, now I'm going to do this now and try to do the best. But, you know, there's a lot of nuances that they put out into it as well. Why did they choose their last metric? Why these datasets? What are they doing with this kind of stuff, most of the time, you're just trying to solve the problem. But all that extra stuff kind of helps, too, because that's where you start relating it to other datasets and other just different machine learning problems that you might experience in the work world.
And again, when I joined H2O, I was on a lot of calls with different customers, and I could keep up with the datasets they were at just had enough experience just going through problem after problem and understanding data and understanding what you would typically do in these scenarios. Just gives you a lot of experience and not just you know, not long-term experience. It's you know, you can get that in the work world, but you know, to spend 234 months on a competition gets you quite a lot. You know, a lot of people don't even spend all that time on it. You know, you jump in the last few weeks and see what you can do. But yeah, it's very, it's helpful, I think for just augmenting the skill set you've learned and in any other way, and it's a really great place to practice seeing what algorithms work and which ones don't. There's a lot of things that, you know, the former data science curriculums are before we can call it data science, you know, would be teaching a lot of methods that you just find don't work very well in Canada. So I'm not the Kaggle is everything. It's always competitive, its generally big datasets, you know, the workload will give you small datasets. And sometimes you need to, you need to prioritize explainability.
But still, I think, you know, what you learn in title really helps give competence so that you can solve those, and you can figure out where's the right time to draw back. So that last bullet point, sort of, I guess, accidentally sort of alludes to one of the questions that we had upfront, so we had one of our panels Sinem, host of chai time, data science, he's interviewed a lot of us, and it's been a bit H2O, he's Grandmaster as well. And he asked, What's your AI superpower? He likes to think of just kind of you see these ensembles. And in Kaggle, and we look think of this with auto ML is we've got the deep learning random forest GBM kind of superheroes doing their own jobs, and they all come together. So the there's a superpower theme that he brings to this. So and this is like an answer he gave on some of our Grandmaster panels, which we just posted one of those on YouTube, and Sydney, what's your superpower AI superpower.
And I think, to me, in essence, preference, but it's really like my strength is towards competitions that involve twists, creativities datasets that are hard to handle loss metrics that are hard to handle, that's kind of essentially how I met Mario's and SRK, by understanding a little better about a loss metric, posting about it through H2O and got to meet both of them and compete with them many times over based on something simple like that. And the second part of that careful consideration of the train test differences.
And so this is how I landed at H2O actually was a, you know, trying to look at a group by a split for how your training and test data was, was given to you. So you're going to be given a whole batch all at one time, but there's no crossover between batches, in your training a test set, very realistic. You know, most times over time, you might be moving around moving some observations over big sets of things, different products, there's all sorts of ways that this relates in real world. And so you use all the experience, you learned a group, but you'll never have one in that case, the reason that was a group itself was highly correlated. So if you knew this is a this is trying to find various minerals, I think, or metal, something like that. So we're looking at like potassium content. And if you would see high potassium content within a sample of 60, you knew that there was a high likelihood that the rest of that sample were there, but you didn't know, in the future, you didn't know, you didn't have any examples from that sample of 60. So you can't use that you can leverage that information, you can possibly have algorithms that feed each other, pretty sure I did part of it. But it was important to understand that you had to not just randomly cross validate your data, because otherwise, if you can see in that group, the models will learn that, and they will understand that they can push the other ones up. If they whenever they see the training test, they will focus on I would just encourage people, if you can use the Q&A, feel free to ask anything.
So I have a couple of questions that we've had upfront. So I'll start kind of from some of those. But these are always best if they're engaging. Stephen was going to ask some questions, hopefully figure out how to put those in the Q&A.
And so one of the questions was, in fact, we didn't even get to this on the panel. So I like this question, too. So it's similar version of it is that, you know, how do you use the Kaggle skills? At H2O? How's your Kaggle experience been applied in H2O? And there's many different ways I think, again, it gives me the confidence to start and but really turning that around? How does Kaggle inform H2O? And you know, we, we've put a lot of the Kaggle problem solving tricks into our product. And so that's, I've been a long user of H2O, actually. So since I met Arno, back then I was usually using a more of an art program. You have to learn Python these days to run deep learning, but I'm still strongest in art. So I used a lot of our and our GVM package 10 years ago that was that was that was a good one to use to do well in competitions. And the St. Louis came up right as I was kind of joining H2O. But I got used to the show API, and to this day, I still use it. So I liked the way that you can see everything immediately. You can launch models quickly that came before I was there. So for me taking the Kaggle experience was to improve our GBM algorithm actually. And so that's sort of a nice day. I guess harmony there. So going in and talking to the CEO and say we've got to catch up to SG booths, they were doing column sampling mainly, that was one of the big features that we were missing. So we got chance to modernize our, our GBM later, we've included actually lose, so you can pick both and I often do so often do an ensemble of x g boost and GBM, all written in H2O.
So again, even to this day, I just prefer the kind of the feedback you get one nice thing is you can see all your variables, even from the first tree as they build out. And so a lot of these Kaggle winning algorithms in the tree world, you know, eventually will take quite a long time for some of these bear ear competitions, you can I once remember, taking my laptop and keeping it open, and driving into the office with it. And walking into the office and taking meetings because I just want it, I think I've been running for about 18 hours, and I didn't want to lose it for any reason. So that's rare that you ran that long.
So for deep learning, it’s not as common, but it can happen. So that process while you're driving up these trees, you know, think you spend 18 hours, and you realize you made mistakes. So, you know, you get to see the loss, you can get lost hurt on some of the other API's. But, you know, in within H2O, I like to see the feature importance, because, you know, a common thing that I'll wind up doing is sitting on a problem, like the capital kind of strategy, a common strategy not, there's no one way to do it. But you know, a lot of people prefer to dig into the problem very quickly get their end-to-end setup. And so come from the data set all the way through prediction, get a compliant prediction doesn't have to be great necessarily, you know, as good as you can manage, but not spending time necessarily trying to optimize the algorithm from the start, don't plan the whole thing. But just try to get the motions going, you get multiple submissions in Kaggle. For those really seeing the dynamics there. Typically a minimum, the minimum I've ever done is one a day. Most commonly, it's five a day, so you get a lot of shots at it, and it's the gamification mode, they're going to give you data you've never seen at the end, so it's all fairly compliant.
If you want to let somebody overfit themselves by just continuing to try to optimize that number. You can, what most people want is a consistent cross validation scheme, or just validation scheme really doesn't have to be the set of crops but so that you don't really need Kaggle, so that all your internal results, you should just see them air out in Kaggle. And so a lot of us, when you get excited about something better, that's the time to submit, make sure that everything's in very correlated from what you get internally, your improvement matches leaderboard improvement on there. So that's kind of the setup. And then I guess back to the point of just seeing everything quickly, you know, that the setup will be to get something started and then try to iterate. Ideally, as modular as you can like one or two ideas at a time, sometimes it can take a while to build, and you don't want to do that might throw in 10 or 20 years in time. But, you know, look at where your models wrong. residuals are often columns.
So look at where it's the most wrong and try to figure out mentally, can you figure out what's going on there? Is there something maybe your model hasn't caught? Can I create a feature that better help it capture those and say, do create something, throw it in your model? And you see that it's no better? And that feature used is not being used at all, then you kind of inspect it, did I mess it up? Does it work? You know, it's essentially a check of validation that you get something right.
So that's one reason I use H2O. So another one, the big obvious one is that a lot of our ideas of the Kaggle kind of a winning strategy that suppose when we started driverless AI five years ago, are in there. At that time, if you had a good handle on targeting coding, you could do really well in Kaggle. And then it just frees you up to think of the problem, I guess, a little more dynamically. And then the models are seeing it as free feature representation kind of possible. That's a slight overstatement. But that's pretty much retargeted with Cody was all about and yeah, that's a foreign term to you probably is for many targeting coatings to target being our answer the why however, you want to think of that encoding is sort of taking all your variables. I use geography a lot and into a hypothetical one. So I'm going to take the city and if I present a machine learning model 2000 cities, you know, I've got a few choices. Providing this text general doesn't work for most things, you've got to convert it to a number.
So you can one hot them, which means you create a column for every one of the 2000 cities. Or you can do something different, you can rank them you can do all sorts of different ways of providing it to the models a number, but one of the cleanest ways to do it and consistence a technique that just works a lot of time this is built into driverless AI and you can read about in our like in the documentation, there's a good write ups of a lot of the feature engineering goes on in there. always recommend reading that it's a good idea. Just for regular machine learning knowledge of the target encoding is going to take each city and encoded with a number which is the average of the target depending on your last metric, you might want to keep it there. If you've been medium, maybe it's a median target, you can do multiple, it's very common, the max min, average standard deviation, you know, common kind of stuff sets. But you got to do a clean way to you can't take into the set that the model is going to use, you got to show it data.
So if I'm going to solve the problem for an Austin area, so Austin, Texas, if the average is 80%, because it's four and a five for that city, it's pretty small. And my target is included in that four to five, that's not good. So you got to be clean about that, remove it, so maybe I make it three out of four. Instead, if that one's going to be a positive, subtracted from the numeric denominator, there's different strategies about this, and we have some of the best in driverless AI. So that's what the strength of that is, again, picking up that same routine of, of validation is everything. And so and drivers say, I really encourage you to bring in a train set, a validation set, and a test set. And we will use the validation set exclusively, to do all the feature engineering validation, which features are working, picking the best model, which is a component of the best features, and drives itself genetic algorithm. So the validation set is key there and you want to test that to make sure even a validation set after we've hit it for multiple models. Going to be clear that maybe you haven't over fit that one so that it's still generalizing. So generalizing is one of everything. few questions popping in. So I've been talking a lot about that. So let's see. Yeah, computing on Kaggle.
Several years ago, I had to slow down as I couldn't sleep or overthinking in a competition implementation to try to compartmentalize and Kaggle. And to never get to you. Yeah, I guess I think it happens. I guess I can compartmentalize well enough that, you know, when, when you have to do work, you and your family, none of those things are there. But yeah, often it was in the back of my head, even during some of those kinds of things a lot. I remember this time when my wife's parents were in town and just had her go on the back deck and go work on something for two hours, I had an idea and I just got to implement it. It's kind of early days. Now. It just depends. You got to be realistic about time that kind of bleeds into some of the some of the other questions in Kaggle. So I don't necessarily have good advice, I think it's good to get motivated for it. You know, if I think if I'm lucky, you know, work problems will do that, too. And so I guess I knew data science is interesting. Thinking about things is interesting.
So for a while it was Kaggle. Maybe the word problems were a little more boring. So I made a couple job shifts, I guess. But you know, now, the stuff we do with document AI now. So I'm doing pretty sure right now. You know, it's pretty exciting. Just, you know, where I do think, you know, that that's as likely to come up as anything at Kaggle. But I guess it's also for me, it's like, I kind of have to, like mentally know, like, I'm in it for this one, like I've invested early on, maybe there's one at a time now. Like, I'll try one. And if I don't latch on to it in whatever way that can mean. Yeah, sometimes maybe have to give up a little bit. So, so that helps a little bit that you know, yeah, it is. It's tough, a lot of us are competition junkies a little bit 98 competition is, again, not all of those spending a lot of time on it. But some of them, you know, maybe for three, four months spend quite a lot of time depending on usually says people treat it almost like a second job. So good question. Good question. So existing or new techniques to recommend create models that are more explainable yet to non-data scientists. Thank you. You're one of 10.
So first time, so yeah, that that can be tough. I mean, you know, so there's a spectrum of what explainability means, and so satisfying Chris Hogan saying that, but you know, so first of all, well, you know, we've put a lot of energy into that in driverless AI. So machine learning interpretability. And Milan, is a big component of that, you see that one of the already books, I think it is, you know, written by deep Patrick, you know, that's a big thing for H2O. And so there's a lot of in that there's no one way to do it, too. So in driverless, you'll notice that we have multiple techniques that are going to help you try to get to explainability. The spectrum sort of thing is, I guess is that it depends on where you're where that interpretability, who needs to interpret it, and how precise they are. And so for example, it was surprising to me a little bit that even in regulated industries, where they do require interpretability, you know, biases are a big deal right now. And the regulation raters for a long time have been trying to ensure that we're not using bias in our models everywhere we can find it, that's an evolving capability, we'll all will get an AI as we learn more and more where bias can creep in and datasets and things like that.
But the interpretability is one of those common first steps is understanding what the model does, I think from a data scientist, even if you're not trying to be explainable explained understanding of models to a good idea. And have I put in a feature - Am I ready for that feature? Do I think it's going to generalize? I mean, you want to have your techniques set up training validation tests, and to be honest I often roll them through multiple time windows, just to see the stability of that model, like get your architecture setup, and then just drop that in and add some data, move it along or expand it either way. So me personally, we've tried to get something like that into some of the tools before, it's kind of tricky, I guess I just do it by hand a little bit. And it doesn't hurt prove the best model. That's not what I'm after, I'm just looking for the stability of me of my setup, I suppose.
So that's, I guess, explainable for me as a data scientist. But if you're trying to deliver something, somebody else, now you can take it all the way to simple decision trees, I mean, linear models, people really like linear models, they've got nice coefficients to kind of tricky. Sometimes you got to, you got to work for your features a little bit to straighten out or you shape into a straight line, things like that, to get optimal performance, at least out of a linear model. But when you've done that, you do have explainable features. Some things that we you know, generally in Kaggle, the more features the better as we that's the one of the strengths of GBM, and you know, the GBM flavors like GBM is kind of the dominant one now and give extra boost cat boost, both common to be used or GBM, similar classes, those, and that one of the things that really frees you up to think and like I said, you focus on the data and not worry about the algorithm so much is that you can add, it's robust to adding a lot of features, co linear features that drives people crazy. And if you're in the linear world, it should drive you crazy. But the trick is just keep firing things in.
So a common Kaggle technique will be get a hold of some good variables, study them and figure out where they are. And then actually just try some things out, try interactions, try putting some features together. And what that sort of means is you're going from big features, and lots of support, I suppose lots of observations, and you're going to make them smaller and smaller and smaller. Every time you interact things, it's going to be fewer and fewer of those, there. And if you compound this with targeting coding, you get a whole bunch of coding or features, and it's okay, you just learn to be okay with that. But then when you turn around and try to deliver something explainable, maybe you're not okay with 1000 features, you know, 200 of which look almost identical to each other. So if you, lots of different techniques are definitely recommending driverless is going to do that with a decision tree. We've got Shapley values are a really common one to understand things like feature importance understands where your models at, you know, what's driving your model. And, you know, sometimes you have to take a hit and accuracy a little bit to get that.
So if I have a model with 1000 features, can I crunch that down to 20? And get within 1% of the answer, and it sounds like a small percent, but it's actually very possible sometimes, because that last 1%, is what happens in the rest of those features and a big statistical model that's going to just run 1000s of trees. But maybe you're better off with it that 20 features or as small as you can reasonably get it and kind of the accuracy trade off with the complexity. So I guess, for me, because I like to say, like started becoming an analyst, I still think of myself as a problem solver that knows how to do machine learning cycle enough to get decent results. But so I tend to over analyze the features and overs correct, like, it's not the best way to do it. But that maybe serves me well when it comes time to go and explain things. Because if you can understand what's driving your models, I typically will poke at those a lot of ways with some really simple techniques. Just EDA actually is another good thing, exploratory data analysis. You know, there's no common recipe doing that. But just understanding that data set as well as you can, you know, there's some datasets that are beyond my comprehension.
So you know, nothing, there's no free lunch rates, there's no one way to do it. But going through things and exploring whatever those features are, and understanding what it about that is I get it, okay, I expected that, because I knew that intuitively, that's how I would have solved the problem, just guessing without data, essentially. And then just kind of bridging that understanding to where the model is. And so there's a lot of ways to do it. I've done this over a long time. And arguably, again, you could say it's even like 10 years prior, because that's a lot of what you're doing with creating reports for things that can be simple job as intelligence and stuff like that. But those are your common features, like a top 10 is not a bad idea for understanding what your model is doing things like that. Sure.
I want to might not take it up front, but what aspect do you learn more in H2O than Kaggle? And what aspect do you learn more? Tigerlily? Sure, yeah. I think H2O, fortunately, because of the situation we're in, we do get, you know, it comes and goes as everything, but we do get to hear a lot of customer problems. So I think for typical people, I guess in my prior roles that had been a bit of a struggle, you know, you all the data is out there, but you don't have time to go like work out a project and really struggle at it. I think again, like, you know, thinking of something can be done, but really trying to execute it and try to figure out how that works. Is framing a business problem? There's a lot of things that you know, if maybe that's a good way to pay attention to. There's probably a lot of answers to this question. Where do you learn Morin H2O within Kaggle in this, I could probably say, the business side of the problem Kaggle is handled all that for you. Like I said, open your eyes pay attention to what they've done, because that's what you need to know you're on your own doing that when it's your turn to do in the work world.
So Kaggle is prepared data, they've thought of a problem that ideally makes sense. It's worth solving. They've picked a loss metric that helps them guide towards what good results are in deep learning. Some of these are well, the metrics still are interesting how you optimize things, you'll see some of the math or an optimization on some of these can be useful gain you 1020 100 spots you look at, there's some, some people are very good at that aspect. And that's good for winning Kaggle problems, it's relevant for the work world and may not always be good for the work world to because if you're exploiting a metric versus optimizing it, there can be a slight difference. On that, have I picked the right metric for the problem. But a good example I would have been, when I was at Dell, we were doing a problem that I just wrote GVM on something and just started working on the problem thinking it out. And after a couple days and realized that it was surprisingly not better than what a model that had been in production that was not barely a model, it was a very small statistical model. And we assumed that we've had a lot more data, we could do a better job of fitting that. But it wasn't true until I handled the problem correctly switching RMSE as the last metric to ma.
And we wound up doing quantiles with that we wanted to do in a few. And so that was the right way to frame the problem. And that was something some Kaggle problems had been exploring ma e and some people get all upset about it because it's hard to optimize. And not all algorithms can handle ma e the H2O ones can't so but it's tricky for calculus on that. So but you know, you can get upset about or you can learn from that understand what it is why are they using that? Why does this make sense? When might I use that. And so I still just said it took me a couple days to realize that and that was setting stuff up. But when I did that, I wasn't going to make progress on the problem because the incumbent solution took that into account correctly.
So you know, we were able to eat it at some point get so but framing the business problem and everything about it all the way to the future. So we can solve this as accuracy, everything. The accuracy that matters is what your business problem needs to do. And that may not be the way you can optimize your machine learning algorithm. Ideally. And I see this a lot with the other title grandmasters that we have on the team hearing them answer questions with customers, you know, that they're I learn as well as seeing their frame of mind, rather than just the solution here their thought process in the middle of and where do they go to first. That's one bit, I can learn HTML, I have learned HTML, just exploring the different personalities and how other people think. And in the repetition, you find that sometimes it surprises you. But really paying attention to what's going to make this business problem successful, which may not be the same thing. But if so, you want to align that as closely as possible. But sometimes you don't have a choice, you just can't solve the problem that way, it's a multitude of different things that all must work in harmony.
So for more on Kaggle than H2O, I think, you know, I guess you can look at maybe, to some extent, hydrogen torch and like, that's where we sort of sharpen our tools and see things used first maybe, you know, occasionally we may innovate something in each tool, but a lot of times we pay attention to category and see that they're sharing and Kaggle is wonderful. It's just gotten better and better over time, to where it's a different category than it used to be, I would say because everyone can copy the good solutions now. And some people are probably fooling themselves thinking they can do that. And then you get out of the work world, you realize, whoa, I was able to immediately stand on the shoulders of the people who were solving that problem. And, and I just got a free pass through that that section. Older Kaggle it wasn't the case that sharing wasn't there.
So there's a little bit take from both sides. But where we learned at shows, particularly in hydrogen torch know that team, when they see something innovative come out, they work to get that in the tool in the next release, or pretty much as soon as they can. So when so they can see you get to see a lot of different problem attempts in Kaggle as well, what works, what doesn't, you can see, you know, the papers are produced in deep learning are great now, and a lot of those do work well in Kaggle competitions. But not everything in the research world that comes out will work well on Kaggle. So it's a little bit of seeing where the rubber meets the road.
Certain expression on that question recommended process for model retraining, specifically figure out whether to include older sample data use only new data. Yeah, I guess well, therefore I guess I say maybe I hinted at this a little answer like I provide even if time isn't necessarily consideration for your normal train test split. I mean, from my sense, doesn't really apply to this answer. But think time first and only prove to yourself that you can do something else now there's all Not at all, if you're just taking images from the wild images this year, not going to be, you know, largely different from images last year, and I'm just here before. But, you know, we're typically not deploying solutions against yesterday's data. Sometimes you do, but most often, it needs to work in the wild in the future, and you want something to be robust. And so that's why time should always be your first thought of how to split something, whether you know, it's not to say you approach it as a time series problem, just consider time. And so that's, I guess, for me, again, these like rolling validations, it's just something really comment, I'll set up like just chunk my data out for 12 months, and then test on the 13th, or whatever makes sense for that window, whatever I'm worrying about. And I want stability. And I know that those models will not be as strong people don't like this in competitions, because it makes the time makes the test set gives it a higher variance.
So if you have random cross validation, you're much more likely just not knowing anything else about the data, you're much more likely to say that my training score in Kaggle, this is my public leaderboard score is going to correlate really well with my private leaderboard score, the further you move away from random, the less likely that is generally speaking, but you know, there's plenty of ways of controlling for that correctly, some that are out of your control. So just in a basic sense, if I'm doing something, it's not necessarily time series, but I'm working with products, companies change their products all the time. And so if I know really well, who's going to be interested in this product that was alive, you know, for the last five years, but we sunset it ringer, new product, there's, you would think that that would say, well, all that learning is gone intuitively, that's what I always thought, well, I don't want my model like just trying to grind away and use all of its all of its capacity to learn spending it on something that I'm not going to have in the future. But if anything in Kaggle, I've found that it's not that simple. Sometimes the models can learn stuff from historical sense that do serve you well in the future. So it's just let the model and the scores, you know, tell the story there. Don't quite like throw that away just yet, even if intuitively and even to this day. Still, it's an awkward kind of thing for me. But it tends to be true. So, so getting your infrastructure set up.
So for model retraining, it's kind of like ideally, you'd have thought of model retraining, even before that, like, like, let me think of how this thing is going to work. Now, not everyone has the luxury of enough time that you can, you're going to have a big data set. I mean, that's, that's tricky, no matter what, and there's no good solution to that, you just need more data, it's always easy to say hard to do. But, you know, if you have half a million rows across three months, and you need to extend in the next four months, you know, that's a hard thing to give up on one of those months. Well, if you're not familiar, I guess you can. But either way, there's, there's some kind of tradeoffs there where it feels like you just can't give up on some of that data. And in fact, this is other, there's other competitions, I think Kaggle out there like to do those two analytics videos, in particular, but crowd analytics, xindi are a couple of other ones driven to get us a really good one.
There's a lot of other platforms out there doing different things, but one of them, two of them, I've had to write messages to the organizer said, You've got to split this more correctly, you're letting everyone use bad practices to get out there and they're doing random cross validation and the solution is going to produce was useless, you can over fit certain things that you would just there's no way you would know that when you deploy it.
So those are the things you want to think about as a data scientist. And I guess, the process for model retraining, the more you can think of that ahead of time not of training like of doing that is that's I do think a lot a lot like really curious how things are going to work in the future studying that of the features How will so like one of the things I like to do, I don't do it all the time, but is to take a dataset where you and look at the one dimensional accuracy of using the best way of encoding that into an average model or like a target encoded model, really. So I mean model however you want to think of it. So I'm going to go through all my features and let's use the city example. So if I split my data, and again, like if I can buy time, so but in any way if I split my data, or got a training set, and I fit a model that gets the average of every city and predicts that is the model if I'm doing pharmacy that's a reasonable way to do it. And you test on the new stuff and just see how well that one dimensional thing works like what's my new you see doing that maybe it's point eight seven already pretty good. So I just like to know that it's always going to get better if you use a machine learning model never seen that ever went out. When you get simplicity, you get to feel how powerful those features are before they get to the final accuracy. And sometimes you get excited about a point eight seven. You merge that with some other dimension the age of users on like that all sudden you're at point nine one, but then machine learning models like point 95 At the end, at least you know what those gaps are. And so but where this ties into the question is seeing that move over time also.
So how does my city, if I have enough of this data, if I train my city in the same way, seeing a single one dimensional model, and yeah, two three is usually like to do is just kind of loop through those began, let the computer do its thing and just observe a ranked order of what are my most predictive features directly on the problem, I got to solve, not correlation with the target, not other, just different statistics that do it with the actual metric, you can do that. So seeing where that goes, and how stable is out over time.
So measuring it over a window and watching it go from when I was eight, seven, but maybe the next period is a two, like it's bouncing around a lot. You know, that's, that's, that's another part to be aware of. And that'll point you to where you, you may not have stability, and then looking ahead at retraining or promoting that what do you got to do? Are you okay with that? So I guess that's one answer. But yeah, that one's definitely speaking to a lot of that my preferred processes working through a lot of problems. So what things cannot be learned on Kaggle are useful should be learned by data scientists for industry point of view. Yeah. All the way on, like, communicate, if there's, there's, there's lots of skills here and lots of different company roles, too.
So there's just no one answered anything. I've been saying that like, what, three, four times now. But so where I mean that here is that we work with some companies big enough, that they have modelers that get to mostly spend most of their day model. And, you know, I think some of the categories like that's what you want, that's your dream job sometimes. But other places, you're, you're at a small startup, that's exciting. It's a new problem. But you're the data scientist, and a data engineer, and some other things or even if you have a team of data scientists, sometimes you wind up spending your time on other parts of the problem.
So a lot of that is you can their tech is not necessarily taught it to you, again, keep your eyes open, maybe you can learn a little bit about it, but they're not really going to tell it you just have to kind of guess. And that's I guess, when I find myself doing a lot, why would they do this, you know, sometimes they may not be done for any good reason. But I'm going to still think through it's an interesting thought exercise, I guess, of why it's there. But I would find this in a lot of competitions, they draw this nice write up, that seems interesting. And then you look at the problem, or something's not quite as pristine as that, you know, even just finding the discrepancies between that because that's what they're going to be asking you in the work world.
So software engineer would be another side of it. I guess, I think that with Kaggle competitions, that pushes the level of difficulty up, and unfortunately, it sort of spoils my old answer of that very first Kaggle submission being simple. And Kaggle is just a CSV, it isn't always anyone that's for good reason is unfortunate. But you know, it means you do have to be at least thinking in slight software engineering terms, to get a model executed, and you've got notebooks, people helping you through this process, for sure. But that's not to say you can get away with not doing good software engineering and be good in Kaggle. I'm not a great software engineer myself.
But when efficiency matters, you know, knowing that and looking for it. I'm not that great. Some Kaggle doesn't teach it to me, you know, I picked it up, I've observed other people do, but there's just so many, you know, people that can do such a better job than me at those things. But there's a lot of software engineering side. So that's efficient, elegant, good coding that you typically associate with software engineers, but it's further than that. Ci CD pipelines. You know, in the work world, this stuff matters, you know, in, like, during the previous question was good about model retraining. That's, you know, you've got a model live, and you've got to retrain it. So what are your processes around that, you know, the ML ops world, you know, that's, that's a term now for the last several years, you know, deploying and have a model repository, all this stuff, like, like, the engineering side of it, or just, you know, it's becoming more mature, that there's ways of doing these things. And we’ve got nouns that we didn't have five years ago.
So a lot of those are not relevant in Kaggle. It that's nice, and it kind of pairs down everything to problem solving and putting the best algorithm to win in a competition, you do have to get it running in a certain time. So and then sometimes you have big problems, you'll see, half the field will like go out on the first rescore some of these competitions. I've seen that a couple of times where people blow their memory limits, because they didn't project where the data was, even though Kaggle was trying to tell you where it is.
So some of the things you can learn but it's you know, I'm not sure the recommended path and how you'd find those things. But the nice thing about joining Kaggle is I know the ones I just spoke about, because I was part of you know, I'm sure it's happened on many, many meetings, but you know, the ones where your part of it, and you know, like hey, there were 1000 people here wackier only foreign like left. You know, 600 Those people have their notebook blower.
So, but I guess a lot of that it can feel like a slog a little bit Doing it in the work world. But that's one of the things maybe gravitate towards. And then I started with this too. You just could never say enough about it in, in those roles that involve communication, which is most but not all, you know, just communication is always a good skill of understanding, trying to get at the person you're working with who has the problem, the business usually, you know, and figure out what's really driving that.
And if you solve that problem, will it be good enough? You know, sometimes, you meet halfway, and that can be dangerous, because they've given in on what they really need, and you're trusting that they're giving in was correct, because I've seen this several times where you solve the problem you had disagreement on? And it turns out, that wasn't good enough, actually, well, did this accuracy. Fine. You've got this is 90%.
But we still missing all this stuff was oh, would we agree to move that out of scope? Well, I can't have that. So things like that a typical kind of project management, software engineering. So there's a big giant domains, books and books written about these things. But I guess I would say more data scientist and not one or both of those two things are very important. And you can get away not being good at those I would say.
I think anybody could achieve a master Grand Master tutor on Kaggle. With enough time and effort. And creativity that Grand Master have can be learned. It's tough to say I think a lot of us like to think yes. I'm not sure about that. Maybe just speak for myself. So I guess longer thought Yes. With like, like, could you train somebody, you know, could you have them working through the same things, but eventually that creative seeks out the training? Can you just let somebody lose and go after these problems? No. But with the right guidance, maybe I suppose I think it's a set of stuff. I mean, it's just how brains learn how all that is important. So it's not always about creativity.
So I think like, I guess, my weak spot, so I prefer that. So maybe a kind of elaborate on that a little more. That's the way I like to solve these problems is to think them through mentally and solve and look for creativity and elegance, look for that golden feature. And but lose sight of if I had spent some of that time or all that time working on 100 features, I would have found three or four that worked about as well as the one plus I never found it, you know, then that happens a lot. So that's teams, I suppose I've learned a lot from as well on Kaggle.
Actually, it's a lot, you know, Brandon and SRK, my most common teammates, they both have an aspect that I don't have, and I can't catch up to it. I know what they're doing. And I'm sure I've gotten better at that. But that's what they kind of do. Like, like they're both of them. And they're both great at thinking of things and being clever as well. But the thing they definitely do better than I do, is that aspect of when you have a good idea, don't just do it. They're like engineering, the engineering side of feature engineering is rolled out to everything, don't be scared of volume, and things like that, even if it takes a little while to run. And I typically do work like one at a time a little more.
So the small minded but so I have things to learn, and that everyone does, I'm sure. But I think a few of the things like I talked about, like when I see people do it wrong, like even if they invest a lot of time in it. But they kind of dismiss the maybe the tough lessons to generalize that they're going to make excuses for why they do I would have got that if just this or that. But you didn't you know, that's, that's one of the things like facing reality of a competition leaderboard is a good thing for you know, internally, if you can internalize that the right way. And there's no right way as I keep saying, but I think I learned a lot from that like, like looking at early on, I don't even do this much anymore, occasional pull up a few solutions, but when learning the most aggressively on Kaggle, every time the competition finishes, you would find write ups from half of the people who beat you, you know, half of the top 10 would write one or maybe sometimes almost all of them.
And if you read them, it's often a mix of subtle things and some clever things that you would have not gotten that or I can learn from that maybe I can try to get that next time but sometimes you look at it, I really thought I would have had that but I didn't they got it because they went one step further or just in one step different. And so I guess the right combination of trying to view Kaggle from an educational standpoint and also trying to do well on it because the two reinforce each other I guess I would say you know in looking forward and always trying to learn you find you know, our super genomes you know that those that have been you know, First Corinthians 33.
You know Grandmaster we have people that have been number one and that's a whole different thing that's and even they will pick problems that they want to learn from, you know, like Philip and Demetri you know that that was kind of their way of choosing new problems that were doing different things for them. Sometimes they were intrigued to solve something different than they did before. So You know, even the people that you think that nobody knows everything, but if some of them feel like they know more than others, they still seek to learn. A lot of times, I think, if anything, like probably my own progression is probably cut off low, because I haven't done that as much. So can anybody still you know, I guess it's a, it's a prediction. So large majority of you put the right time in and can land on some good ways of doing it, things that work for you work on progress, you know, I don't have that training management can't give it here. But, you know, I think a lot of people probably could. But, you know, is there still something different, I guess there's some people that just don't take in data, I guess I've run into some of these people and maybe fail to acknowledge that for a while. So for me, like just I just think naturally, in terms of spreadsheets, practically, data tables, you know, things like that, I'm just looking at them all the time, just my mind is geared towards that a little bit.
So I'm sure that helps me in ways that other people have to overcome or gain that or whatever, similar ways for me to so I would certainly, I guess that over half of people, as long as you are motivated to keep doing it, I suppose I would say that Grandmaster title can be you know, if you learn enough to join teams, you learn from teams and teams help you get gold medals, you know, there's no denying that you know, and but you got to do solos, that's, that's the grand master list master tier, and there's well master tier for exactly what it is. But I know when I got the master tier, I was the only thing existed. I know where I was exactly at that moment, when you're more turned over was so exciting. And I was so close from so many competitions before that 12 And a lot 15. And then finally got the third or the second one of these. Yeah, challenge.
And yeah, I think you know, just plugging away and learning and trying to always improve, I do think that probably, but who enough motivation to stick with it probably can earn a master tier as well. But you got to just be not cut off to learning and always trying to improve if you're, if it's a little harder to get started and acknowledge that and just keep trying to find your own goals. Some techniques rank feature importance in relative order, but the features may not really be important compared to continuous random, discrete random question, besides permutation feature importance, would you recommend other techniques for identifying features? True important? Yeah, that's tricky.
Because so again, if the rank feature importance is a good one, however, we're using GBM models or trees, you know, you'd like random forest better or something and a lot of the tabular like and to acknowledge this, at least, you know, like deep learning is coming. And at some point, deep learning will take the tiger world over. But as of now, it hasn't been you can look at I don't think at this point, there's no better way to prove that in Kaggle, you can still do well with like GBM models. So framing them and deep learning frameworks as transformers takes off, and the new architectures will eventually probably get to where deep learning can take these over, but it isn't now.
And so then, when you're using this tree models, that feature importance still really helps. So it is definitely a great tool to start out what's going on with my model. But you have to also be aware, like I said, we're going to see we're not scared of coding or features. But that's not to say that the tree models are going to agree with you with their feature importance, because that kind of means that they can pick and choose either one for most of the problems. So if coalminer features meanings, or let's say I do to see once again, I'm going to have a target encoding of the city, all the cities in my dataset as a feature, and then I've seen that that's a good one.
So I'm going to just grind away at that. So I'm not going to interact city with Asia. Interesting. I said before, just age and age, gender, date of things, whatever's in your dataset, your common kind of features, interact those and keep digging deeper and deeper. And you'll probably see this feature’s importances lift up, but it'll smooth the whole thing out a little bit. And you're previously good feature, the actual one doing the work is the city encoding, but the interaction of it is or something else. So I think just going after it and I guess the MLA side of it, understanding was making up that model in the tree where you don't have to cut the last features either. And I'm not a fan of that necessarily. So you know, if you have 1000 features common one, we can take all the ones that score zero and throw them out. And it's amazing sometimes how well your model will get worse.
Now, in the work world. If explainability matters to you, almost even at all, then you're probably fine. Taking that tiny hit to the fifth decimal. So Kaggle we're competing, we're competing on decimal places many times. That's why it's so impressive when you see the first-place person wins. Huge way away or the top few have broken away from the pack. So you know that happens too. But wherever you are in the leaderboard, you know you're going for big games, but chances are you're going to get small games most of the time on your way there.
So at the end Kaggle that left decimal place can matter and so maybe still having an intuitive idea of what your model is doing and where the features are and ideas you don't explore. So I guess for, I guess, my process, I mean, there's not going to be a great answer to this one, but it's just certainly my own answer. But some of the favorite things I have working with teammates, it's just listing ideas that you can do, and slowly over time executing this, and that's no different from normal project management, particularly product management, what do we want this product to do?
Let's make sure I get the basics, you know, cut in your first release, but then add the features we've always wanted. And sometimes you realize you don't need that feature later on. I think the same thing is true of these datasets. And so you know, getting the basics in order getting a model that you liked the performance where you think it's right, that's harder, when you don't have capital, all everyone else competing at you, they tell you what, right and wrong is, to some extent, but to a large extent, in the work world, you don't really have that, that that other challenge your model to push you. And so you might have thought point eight, one, you know, f1 is free. And if you didn't realize it, nine, one was achievable, you know, some of those things can happen.
But that intuitive sense of the model. So getting the basics up and running, and I maybe it's throwing all the features in there, I mean, my own approach like differs, I guess that without one dimensional kind of a method that we're talking about, I'll usually have both, like, I've got a GVM, that's set on probably all the features, to be honest. And then my little targeting coatings that are working towards that. And at some point, I take the best of those. And that informs him what to focus on how to get more value out of it. So I would say that for I guess the features that are important, I mean, I'm with trees, you're not too concerned about that.
So continuous, random, discrete, random, to be honest, I don't pay attention to that. So the trees don't care, it's a little better to get them shaped in order if you can, so but that's what you focus on. So I'm going to say so. So I say don't care about it's not really true, it's just don't care about the specific terms, I trust, my models are going to cut that distribution really well, I don't have to straighten out that U shaped curve like I would in a linear model, they're going to cut it to pieces, and eventually they're going to, they're going to make those straight lines, eventually, we can trust that our SG moves like GBM models studio models will do that is generally do like, and generally, you can fairly reliably 999 point something percent of time, they will straighten those features out if they're important. So I would trust them to find the models. That's the first part, maybe I suppose.
And, but then, with feature engineering, there's a better way to represent them. And so that's really where it is. So I guess, again, there's still no right answer for anything. But, you know, you can look at your residuals to find the ones that are performing worst and try to figure out what is it about these that I can do? Is there a better representation of those features that are driving this? You know, if you can get your head around it and realize what it is? And that comes pretty quickly? Sometimes you can’t just look at the data and like, I have no idea like the law of predicting 92%. That seems like a good idea. But the answer was false. If you can get your head around it, or you understand the way the features are, or you can throw a less useful model. Again, if I can work my way up. I don't always do this. But you know, what, did my simpler model do for the same thing? And what's the difference? Is this model fit?
You should or not, though, so my particular strategy is not necessarily to worry about its overfitting in that way, it's probably something I just haven't shown it correctly. So this came up in one of the talks. But you know, how to, what can you all these GBM models are using decision trees underneath, you can use anything and occasionally you can set GBM Next up was to layer model, but typically reason system tree is what decision trees do. They're really simple. It's just left and right.
So, you know, if it's if it's a category, well, there's different ways you can handle categorical is grouped by split or not, but they're going to send a bunch of values, or particularly just one common implementation, just one value left, everything else goes right Numerix are nicer and easier to chop up, find that step, best point that optimizes sending my data left and right, we do this with validation sets all the time.
So it's cleaner than it may sound decision trees will do that, and then done. But GVM essentially just keeps doing it over and over oscillating, the features and the rows. And so but really have a really simple tool at the end of the day, it's picked a value everything below it goes this way, everything above it and equal goes right or informative will goes but yeah, it's a good recipe for doing that. But what it means is sometimes you can't see things and your model cannot necessarily put two features together. We yes decision tree can split on feature and send you know the low parts left and right hard parts, right. And then within that low part split on feature B. What they can't see is both of those at the same time. They only see a huge mass of, of observations at one time and so it's Sometimes it's nice to show something in the data can't else can't see across the rows, that's more of the thing like that the feature has been, it's they can't see across rows.
And so if, if the concept about a customer is a number of times that that customer has shown up, the model does not know that it doesn't know it'll try to memorize the customer number if you've given it information and find the features that describe the customer. But what the customer that the model does not know is that that's a common customer. So that's a common and again, going back to driverless, that's a really common feature that we like to put in there. So it's not necessarily I mean, it's a mix of everything and guess but think of the best way of shaping these variables, usually for me is trying to think of what the model cannot see because I'm mostly going to trust it, even if it's not perfect. It's close enough. And I guess me bang for the buck, where I'm going to spend my time is not necessarily going to be to constantly tweak the way that I'm showing a feature to it unless I think it can't see it, you know, and that's a vague concept.
But if you think in decision tree world, what can I not see? How does this algorithm work? That's, that's what it is. Which is why again, just counting stuff like categorical is counting them getting these averages, getting things that can't see across the rows, otherwise, it's way of fitting is just this and everything left and right.
So if the better, the more you can see what you think of predictive features of this common customers, things we might describe, in human terms, it's trying to get those in there. So I guess, that doesn't really speak to feature importance, but I guess it's, that's where I spend my time on there. I do mostly trust the feature importance, but I will commonly look through they're mostly just again, for that same idea of testing something.
So throwing something in there seeing if it's useful, and if it isn't, why is there already and there's not there's two answers to that, at least one, because I messed it up, messed up my own feature, and we go back and see did that work like it was supposed to, it hasn't helped them up. So that's a good sense check. But also, maybe it's covered in all my other features, the models actually found it through four or five features. And at this part, you know, we're getting close to time.
So, you know, some of these techniques I have will start to lead you these models are powerful, if you can overfit those features. And so sometimes I would like to handcraft, this nice blend of five features into one feature. And that's super powerful. And it's super easy, it's easier to explain, it seems to make more sense. And so you can roll all that stuff I've been talking about for a long time, the city and the how many times we've seen that customer, you know, common customers get their own bucket with an average that that's pretty self-explanatory when you have it.
But the models generally perform better, if you let them break it out, they'll usually learn faster, if you show them big, heavy features, where all the all the create all the thought has been baked into one feature, they're going to love that feature, probably. But at some point, it's just one feature, and they are more robust to fit it and you've put some of your own assumptions into that feature of how you're going to do it. And if they can see it more natural.
So that's a tough pill to swallow. Sometimes, if you put all this work into something, and it helps you for a while, but at some point, you almost have to like back off and let it find everything and I don't know, it's an art form probably or it's probably science that I don't know. But that I've just seen that time and time again, it's usually the clearest way to see it is because you've let them on, it's already found it with four or five features so much as I tell the tale as if eventually, you're going to want to back off. But usually it's more than that two competing models.
Why is my super feature like not helping her that models not as accurate, I get super, super feature away sometimes. So that's a little tricky. So sorry, talk so much on some of these questions, I definitely haven't quite gotten to all of those. And we are at the end of time. So ideally, we can set some of these, you know, find me on LinkedIn or something if you want, as some of these will maybe find a way to get some of these questions answered. We'll also do more of these ideally.
So hopefully, I'll get some stuff set up in the new year. And you can ask some of these on the more tabular approach you'll find a lot of people in the show right now are great at deep learning. Some of my good finishes are in deep learning to go a long way to learn on there, the Kaggle helps me do it and the other platforms even Cindy cradle acts as a fun place to get going without the pressure so much that she isn't there mentally.
It's still the truth of why that do sometimes it's a little more fun to get around a problem around Xindian just trying to try to do a good job with it and learn what you're doing and still kind of a beginner immediate intermediate kind of level and leveling up and then Kaggle people will gravitate towards, like a good solution. You can copy that, take that and branch that and optimize little bits and you learn from both sides, I think so.
With that, thanks for some great questions. Thanks for hanging on here. And hopefully we'll see you on