Return to page

Kaggle Grandmaster Panel

Read the Full Transcript

 

 

Wen Phan:

 

Anyways. Hey all. Thanks for joining us today. This is an extremely intelligent panel, and I have the tremendous honor of having to moderate this. I really feel like I get to interview rock stars here. For those of you who don't know me, I'm, I'm Wen and I work at H2O. I'm a customer-facing engineer/data scientist. And today we get to talk to a number of Kaggle Grandmasters and Masters including the number one, number three, and I think number five. Person. And, oh, I'm sorry. One, two, and five. See, I already made my first mistake as a moderator . That's what happens when I have a few drinks. So, before we get started, let me introduce each of the members, maybe starting with Faron at the end to introduce yourself and maybe to share a little bit about how you got into Kaggle.

 

Mathias Müller's Kaggle Journey

 

Mathias Müller:

 

My name is Mathias Müller. I come from Germany. I'm currently living in Dresden, born and raised in Berlin. And my name in Kaggle is Faron and I'm currently fifth in the world. And I started Kaggleing after I finished my studies and I wanted to get out in the world and apply what I've learned. And I pretty much failed at my first competition. It was pretty bad and I thought, oh, I started the wrong thing and what is it? Well, I don't know, actually. And that hooked me. And I wanted to get better at machine learning and predictive modeling. And that's where my journey actually started. And doing that changed my life, obviously. And I learned a lot of new things. And I also was happy to meet amazing people. It's a great community and it's really a lovely place. And probably, not probably, but for sure the best place to learn state-of-the-art machine learning.

 

Wen Phan:

 

Great, thanks.

 

Gilberto Titericz’s Use of Kaggle to Enter the Data Science Industry

 

Gilberto Titericz:

 

Hello everybody. My name is Gilberto. I'm actually number one on Kaggle, and I'm graduating in electronics engineering in Brazil. I started my data scientist journey in 2012, like five years ago, competing in Kaggle. So Kaggle helped me start on that data science journey. I did pretty well in my two first competitions. That's why I continued learning and learning even more from Kaggle. And that's it. That's me. I'm still competing. Not as much as before, but I still am there on number one.

 

Marios Michailidis' Kaggle and Data Science Experience

 

Marios Michailidis:

 

Hi, I'm Marios Michailidis. My Kaggle nickname is KazAnova, which is based on software I had built. So I really wanted to become a data scientist. And I tried to teach myself machine learning algorithms, how to code them. I built the software at some point, which included some basic statistics. And it was mainly focused on credit scoring. And I named it after my mother's last name, which is Kazani. And Annova, which is a statistical term, is for the analysis of variants. The thing it encompasses is the love I had for family and statistics. So then I wanted to see the next challenge. I felt I had upskilled myself enough and I wanted to test how well I would do against the world's leaders.

 

And I found Kaggle. And although it was a bit demoralizing in the beginning, because everybody was so much better, I've learned a lot and it is very addictive. You start one competition, then one brings another. You really want to become better. You learn a lot. And this is pretty much how I started my Kaggle journey. I'm currently number two. I used to be number one. And I will keep competing because I totally love learning and I want to get better. That's about me.

 

How Dmitry Larko Got Started in Kaggle

 

Dmitry Larko:

 

Okay, so, hi, my name is Dmitry Larko. I live here in the Bay Area. I was a pretty happy data work house architect until my father actually showed me the Kaggle and it was like five years ago. And after that I was just partly trying to be as bad as I could on Kaggle competitions. So I'm currently, I would say, in the top 100. So way worse compared to you guys, which has actually made me more desperate. Okay. All right.

 

Mark Landry's Experience with H20, R, and Kaggle

 

Mark Landry:

 

Hi, I'm Mark Landry. I've been at H2O for a while. I actually arrived here because of Kaggle really. I started competing because I was going to some R meetups and one of the people there at the time was ranked fifth and he was doing a talk and he said, or actually he wasn't, he interjected on somebody else's talk and he said people don't do Random Forest anymore, they do GBM. And that got me looking at it and I was just looking, and people had asked me before about Kaggle because I started doing this data science thing. And I said, oh, do, can you do that? And I was like, I don't know how to do that stuff. I'm just learning what I'm doing, but it's so easy to compete that I looked at.

 

And I was like, well, this actually could be interesting. So my first Kaggle mission was a CSV created from a database. It had like eight different values, but you get like halfway in there, it's like, well, that was easy. I'm sure I can do better than that. And five years later, here I am. So it is addictive. I did pretty well on the second one, and so it was easy to get hooked into all of the gaming that goes on. But there's a lot to learn. And something I used to tell people a lot is that the law of diminishing returns. It feels like with Kaggle, it waits a while to set in. Like, it seems like your next competition's going to be a different data set. There's different tricks there. So you're learning something new for quite a while.

 

I'm not as active as I once was, but it's really fun and it's really addictive and, if you open your eyes, I think there's always something to see in Kaggle. So I'm here because I reached out to Arno, our current CTO, who was doing Kaggle and just to get H2O awareness out there. And so I saw it, I was playing with H2O for that reason. And I reached out to him to team up and started doing that with Arno. And then I eventually moved here, although I moved back. But currently, I think the best ranking was 33rd. I'm even ranked below Dimitri, so I guess we're nearly going the wrong way here, but it's really fun. It's time consuming, but it's really fun.

 

Question and Answer Session with the Panelists

 

Wen Phan:

 

It's actually going the right way, because I'm pretty sure my Kaggle score is the lowest. Anyway, so we can use Slido. I have a few questions, but this really is a fireside chat, so if you guys are not into using electronics to answer questions, just go ahead and raise your hand. It's a small enough group now that I'm more than happy to just call on you and whatnot. So here are some of our questions and all things being equal. Maybe we'll just start with the first one. How do you decide what competition to compete in?

 

Deciding What Kaggle Competitions to Compete in

 

Mathias Müller:

 

Well it basically depends on the type of competition. So I personally prefer, let's say the old-school tabular data competitions like binary classification or aggression. So far I skipped the image classifications a bit because in the past I didn't have the GPU power to really be competitive in that, but I think that will change in the future and I will probably focus more on that. But basically it's currently in the past it was the tabular data competitions. Other than that, there's not much that you can say in advance because actually, you have to look at the data to say if it's actually worthwhile to compete at it. But potentially every competition is interesting. So, yeah.

 

Gilberto Titericz:

 

Well actually I joined all the competitions, but actually I just dedicate my work and my time to the ones that I feel more comfortable with the data set. In my opinion, there are two kinds of competitions right now. The deep learning ones, image classification, structured data sets ones. They are really different from each other. And I got pretty used to competing on the structured data sets. Before I got very comfortable doing those competitions, but two years from now, I started doing some deep learning and image classification competitions. I want to have more knowledge in that area also. So all the competitions.

 

Wen Phan:

 

I also read, read an article once that your wife gets mad at you every time you touch a computer. Go ahead. KasAnova.

 

Marios Michailidis:

 

So I basically see where Gilberto goes and I try to compete against him. So I have to play in every competition. He doesn't give me any, and it's like, no, I'm kidding. But it's true. I also like to try everything, but I think the urge is, I just know if a competition goes by and I don't enter, I feel there's knowledge I might miss. I know there is always new stuff that comes out and it has happened to me. I miss one or two competitions and then I feel it is tough to go back in because there's progress and that progress is very quick. But I'm keen to, I think my mentality with choosing competitions is similar to Mattias. Actually, I'm not very good at Image Classification, so it's not that I don't enter, but I don't invest that much time. I probably spend more time with tabular data. Obviously I like a lot of stacking, so competitions are ideal for this tabular data. A clear metric like for regression or binary classification will be good for me. But in principle, I try them all.

 

Dmitry Larko:

 

Well, I'm just trying to enter a competition when I really understand what actually needs to be done. That's one of the main things because there are some competitions, so it's really hard to understand what metric you're trying to optimize and what data set you have. So it's more like an understanding. And because I'm lazy, I never compete in a lot of competitions. So usually I could cherry pick one, but two at the max.

 

Mark Landry:

 

Early on it was definitely only one for me. And so I used to try to get a bunch of people to do it together. And so we would actually try to pick ones that we weren't as familiar with. Like people who wanted to learn NLP, took a vote, okay, we'll do that one, and moved around to some black hole weird stuff so it's fun to move around at first and then you sort of get to picking the ones that you're a little better at. The ones I prefer the most are the ones that have a lot of relational data in there that you have to think about, feature processing, things like that. That's what I'm best at. I don't get much out of the hyper parameter tuning war. But you can usually still find interesting things in a lot of them. So, but I guess ones that look more like business data sets tend to be interesting to me and where I can think about what's going on. But, sometimes you can spot one that has like, maybe small rows and you just know everyone's going to go with a giant KazAnova style stacked model and that's not me. So I try to stay away from those.

 

Best Feature Engineering Tips to Help Improve Kaggle Score

 

Wen Phan:

 

Awesome. So it looks like we got two likes for this one. Pro tip, any feature engineering tips you guys want to share?

 

Dmitry Larko:

 

Well, let me up with myself, right? So I gave one webinar and two talks on our meetups, what features you need to do, and like, what tips I would like to share. So you can find them on our website. I will give a sales speech for myself.

 

Mark Landry:

 

And you put really good features in driverless AI as well.

 

Dmitry Larko:

 

I put all of them actually there, except the most important one.

 

Marios Michailidis:

 

I think the features that Dimitri has is pretty much what I use as well. I think it's valid to say that every type of competition needs its own features, like time series. You need a different type of feature. You need lots of flags, NLP, you need different features, the FIDF, maybe some SVD on top of that. But I would say the two features, which are alike, is because I started a lot with linear models and, and logistic regression. I like a lot to work on bending numerical variables and creating ways of evidence is something that is a type of target encoding suited for binary classification. I put a lot of effort into this type of feature because I like them. This is the first type of feature engineering I've learned. So I do this quite often.

 

Gilberto Titericz:

 

Okay, in my opinion the best feature engineering depends on the data set definitely. But one feature that becomes hidden for a lot of time, just a few people use it before for feature engineering, is target encoding. I think target encoding was the key for many of my previous Kaggle victories in the past. But it's not only about target encoding, it's how you process the categorical features, right. And right now, one feature engineering that is very helpful for all the models I'm testing is the Categorical Embedding using TensorFlow or some deep neural networks. So what I would say is, try all the feature engineering you know, for all the data sets and pick only the ones that work better for your data set.

 

Mathias Müller:

 

I would like to give a more general tip to think about the strengths and weaknesses of several models. That's always a good starting point. So if okay, how does a tree work? So okay, it's good with Omega features, it's pretty bad with high categorical features. So target encoding comes to mind and if you have time to use data, you know tree-based models are not able to extrapolate. So what do you do about it? That's the question you have to ask yourself. And then you go ahead and search in Kaggle forums and read the winning solutions and stuff like that. And then you get the idea about what actually helps. So you need to try a lot of things. And like Gilberto said, it's always really data dependent what works and what does not work, but you get the right hints by asking yourself the right questions and knowing the strengths and weaknesses of every model.

 

What Are the Best Tools to Use in Kaggle and Data Science

 

Wen Phan:

 

I know one thing that I've learned from these discussions with Mark and Dimitri, that's not necessarily a feature engineering thing, is what Mark would call a validation strategy. So thinking really hard about how you do your cross validations, how you make your hold out samples and what's actually getting trained on at least is a tip that I picked up from these guys. I guess we'll go to the next one that's ranked. What are, what are your favorite tools to use?

 

Mark Landry:

 

I'm definitely liking driverless cars now, but we'll keep the pitch out of it. Although that is part of why I was in first place a few days ago. I'm data.table most, I've been an R user coming from SQL, so I use it for everything. It's just, Brandon as well. It's interesting because it's enough of an ecosystem where we can find people that know that and use it for the projects we work on. But that's where I do almost all my feature engineering. And just exploring the data set and everything like that. So data.table and R, I guess, is going to be similar to your Pandas.DataFrame and things like that, just faster.

 

Dmitry Larko:

 

I am mostly a Python user, so that immediately removes data.table out of the equation, which is sad. I think I'm asking Matt to create a Python version for the last two years, no, year and a half I'd say. But in general, I would say, because I'm a Python guy, scikit-learn is actually one of the most frequent tools I use because it has a variety of machine algorithms and is like my weapon of choice. If I have to do something like a bigger dataset, of course I have to pick up something else, but that's basically scikit-learn on my own scripts and maybe DriverCI, but not quite sure.

 

Mark Landry:

 

And I should refine mine, as most of us will say, we all pretty much use some flavor of GBM these days, and I use GBM for almost everything, even when it may not be the best tool for the job.

 

Marios Michailidis:

 

I have to admit I use a lot of tools and I think if you want to do well in Kaggle, you have to know the tools generally. There are different competitions where, or different problems that other tools would work better than others. But in principle, I also use Python personally. I like Java as a language and I started creating my own things over time. I think I have fully switched to Python for a lot of things at least. I mostly do pre-processing with Pandas and sklearn. In terms of algorithms, I will say I really like light TBM. As for boosting trees for Deep Learning, I really like the H2O and Caras. For linear models, I like the LIBLINEAR implementation. I like LIB for factorization machines and LIBFFM. I don't know what else. That's pretty much it. These are the tools I use most of the time.

 

Gilberto Titericz:

 

I have a lot of tools that I like. I started machine learning programming in MATLAB. I won my first two competitions, sorry, my first competition Kaggle using MATLAB Neural Network Toolbox. Then I started to learn R. I won a lot of competitions in Kaggle just using R and then also in R I like very much a data.table. I use it a lot, data.table in R and then I found that scikit-learn and I started to learn Python also. And consequently I need to learn Pandas to process the tables, but I found a lot of external tools and command line tools. I use a lot of command line algorithms in the past, like WOP a bit, LIBFM. Well, I can list a lot of things, but that's the top ones for me.

 

Mathias Müller:

 

For me it's pretty similar. Except for R, I'm not an R user, but I use Python mostly for Kaggle competitions and R libraries, which can be interfaced with Python. And for models, it's usual suspects like GBM, CAVA and scikit-learn. And if you have that in your tool set, you can get quite far. And then for some competitions you need like, maybe factorization machines and you go more to the command line. Sometimes I don't really like it. I try to avoid it, but sometimes it's not avoidable, but basically, that's an overview.

 

Utilizing Multilayer Stacking in Data Science Applications

 

Wen Phan:

 

I'm going to ask the first one that was starred three times about KazAnova's crazy stacking methods.

 

Marios Michailidis:

 

KazAnova with K, right?

 

Wen Phan:

 

I know I didn't write that. But I think they're giving you a compliment.

 

Marios Michailidis:

 

Because this one refers to the lover. I think I'm not crazy, but I like to do multilayer stacking. Imagine that you combine some models with another model and you build so many of these models that then you need to combine these models who use models as input. Maybe it didn't make sense, but the idea is that I try to know that it's very difficult to get the structure, right? So I tend to create many hundreds of different models. And I try to use different permutations of the data use time. For example I'll make one dataset where I would transform categorical variables with one Holden coding, and I will build an active boost. Then I will build another dataset where I will try a completely different transformation. And I will keep doing this until it generates a very large spool of models which are semi-random.

 

Not completely random, but semi-random. Why? Because I don't want to get stuck on some Local Minima. And then, once I have created this large spool, I use other models to select which of these models are useful. And because I have started with such a big pool, I am required to do this on many levels, keep filtering, until I reach a point where I can no longer improve my score. If that makes sense. I will talk more about it tomorrow. So that's another reason for you to come to my talk. But this is the basic idea. So generate a lot, because it's not easy to know before which structure or which models are best, and then filter down to what is really most useful. That is the basic idea.

 

How Multilayer Stacking Works

 

Wen Phan:

 

And Marios and I had a conversation Friday night with a client on this. And for those of you who are not familiar with stacking, it's essentially that one of the fundamental assumptions is that no single model or learner is perfect. They all have baked into them some assumptions and will do well on certain occasions. So what you can do is leverage all of them. And what you build is a model to decide which one of them to use. So I call it like a manager. You have all these little workers who are good at different parts of the problem but then you have a manager say, okay, I'm going to pick this model for that region or whatnot. And he's basically built an entire, if you think of a manager analogy, hierarchy .I think you're talking about like a manager of managers, et cetera, right? Yeah.

 

Marios Michailidis:

 

Manager, then you have to put the CEO. So as you said you have to build this hierarchy.

 

The Best Resources to Learn Feature Engineering

 

Wen Phan:

 

Great. Okay. This is I guess the next question, just by rank order, which is very similar, I think to the other one. But, do you guys have any recommended resources to learn the art of feature engineering?

 

Dmitry Larko:

 

As I mentioned before, we have some links on our website, right? So basically I did the webinar and about feature engineering, I did a couple of talks on our meetups. You can also find the slides on our GitHub, right? We have a GitHub, yeah. Awesome. There are a couple of very good presentations on the slides show. You can just google a feature engineer and slideshow to find them. But there is not actually a single place which is dedicated to feature engineering. All the resources are spread across the internet.

 

Marios Michailidis:

 

I would like to add, apart from Dimitry's feature engineering presentation, which is great, I would like to take this opportunity to advertise our Coursera course. We made it with other Kaggle Grandmasters. And it focuses a lot on feature engineering. Don't remember the exact name. It is something like "win data science competitions", but we have capitalized a lot on feature engineering and what you need to use in order to win machine learning competitions. So I advise you to try it. Lots of code, lots of notebooks. Obviously it is based in Python, but it should be very comprehensive.

 

Gilberto Titericz:

 

Actually the question is very good. Feature engineering, it's an art. And like every other art, the only way to master it is to train. So you need to put your hands on some models. You need to try by yourself. Most of the automated machine learning models right now are trying to do that automatically. And this is something that is very hard to do automatically. That's why it's an art. So what I recommend, to learn future engineering, is doing Kaggle. Put your hands on some problem, some challenge, and build a good solution. Try different features from your mind and you see if it's good or not. And you see every data set is a way to do feature engineering and practice

 

Mathias Müller:

 

I can probably add some more resources. It's a talk from another Kaggler, I think a Grandmaster. And he had a really good presentation which gives a really complete overview about future engineering in general. So it's a good overview. And then I can just, second that you actually have to apply it. You have to train it and you have to think about what could be useful on that data set? So maybe I perform some EDA and I see something and how can I exploit that? So for instance maybe a little backstory, if you compete at Kaggle, often people find something in the dataset, and they post it in the form, but they don't think a step further and they don't try to exploit something.

 

And I take such information happily and say, okay, let's look at this. And he found it for me, he shared it. And often that leads to something, some data insight, which you can use to create some useful feature. And that gives you an edge over those guys who are not trying to make something out of it and just claiming, okay, there's something strange, but I don't want to think about why it is there. So most of the time there's a reason. So get your hands dirty and train. It's really learning by doing.

 

Mark Landry:

 

And I think a bit different of a take on this would be to sort of think at a pretty bare bones level of what these algorithms are doing. So XGBoost is just decision trees at the end of the day. And decision trees have their limitations. And so you're sort of thinking of how you can overcome this? How can you represent a feature? And if you put yourself in a binary world, how can I get a question to be answered where the left goes left and the right goes the right? And that's ultimately what you're trying to do. Now you can go overboard with that and sometimes I'll team up with somebody who comes at me with a very generic solution. I've overthought it, but I still think that helps overall to be trying to think of how you can extract a feature.

 

Because there's embedding schemes or showing the features to a model. That's really one genre of feature engineering. There's feature crafting, which is sort of more of what Ferons talking about there is, how can you take an insight and represent that best to the model. Where you're going is almost deeper than normal. Like the algorithm couldn't see that before, so you're sort of trying to unearth things that the algorithm couldn't see. So part of that is what the algorithm couldn't see because of what you're doing. And there's a lot of debate. You're going to hear target encoding, you've already heard it. That's a different encoding scheme other than One-Hot or different ways of doing it. And, and it matters. So I don't know, for me that helps. I tend to think in flow charts and decision trees. That's really all they are.

 

Tips and Advice for Feature Engineering Newcomers

 

Wen Phan:

 

Great. Okay, next question. It says if you could go back in time, to when you first started, what advice would you give yourself?

 

Mark Landry:

 

I don't really have any regrets, so I don't know. I wouldn't change much. I think I like the path that I got on. I think you have to be, we're all interested in doing it. We keep doing it. So I don't really have a good answer.

 

Dmitry Larko:

 

Yeah. I agree with Mark, but I do have advice for myself. For Younger me, basically, right? I think about that carefully. That should be something like very short advice, very practical and very easy to implement. So I just say to myself, like my younger self, just two words; buy Bitcoin. It was five years ago, right? So it's the right time to buy it.

 

Marios Michailidis:

 

I actually don't have any regrets to be honest. I mean, the only thing I would say is, it's something that came to my mind that I would like to convey because I don't want people to feel it as well, if you may get a bit demoralized in the beginning because it might be a bit daunting. You may try, you may not do very well, just don't be demoralized. You'll get there. You put in the time, you will struggle. Maybe the first or second time. But people are very willing to share, they're very helpful. The forums are always busy. You'll get there. Just don't feel sad if you don't see yourself winning in your first competition, right? You'll get there. This is what I would say to myself.

 

Gilberto Titericz:

 

I would say, spend some time analyzing the data sets, and build a very good validation strategy. I mean, a cross-validation. This is the base for a good model build, having a good validation strategy. Most of the people just fail on building that good validation strategy, and that's why they fail. So build a very good validation strategy and trust your offline metrics.

 

Mathias Müller:

 

I would say to myself, don't spend so much time on margining. So my first competition, I spent weeks and I just tuned models.

 

Marios Michailidis:

 

No, don't say that. I mean, this is his biggest strength. And this is why you got it. No. Pick something else.

 

Mathias Müller:

 

No, no. I'm serious about it. The problem is, what I see is that many newcomers spend a lot of time tuning the models and trying to find the perfect set of parameters. But, at the beginning, you get something out of it and then the time you put into it doesn't pay back. So you need to find a trade off because there are much better ways to improve your final results. And what I observe really often what newcomers do, and I did it as well, was really spending time tuning models for the fifth digit improvement. And sometimes you need those step changes. We call it step changes when you really find something new and some new idea to make a leap in your score and model tuning won't bring you there.

 

And nowadays, I spend really little time in model tuning. It was a strength of mine in the past because at some competitions, if it really comes down to the fifth digit and it was like 15 places on the leaderboard, then it might help. But in general, I don't like to do it any longer because it's already good. I have some scripts doing it for me. But in the past, I did it minimally, so I really looked like hours onto my screen and stopped and changed parameters and restarted and watched again, watched the traces and stuff. And I mean you can do that, but you can spend your time much better.

 

What is the Value of Kaggle Outside of Formal Competitions?

 

Wen Phan:

 

I want to interject with a question. I've heard some advice. I mean, a lot of what's consistent with all what all you have said is get your hands dirty. What do you guys think the value of Kaggle is for those that don't want to be competitive? I think there's a missing piece in the community itself. Maybe the discussions in the kernels and whatnot.

 

Mark Landry:

 

Well, I would almost take it from that bottom question. I mean, I think there's a lot to learn about real world data sets. I mean, if you keep your eyes open, I've met with people here in the Bay Area that are just not that interested. If they can't get behind whatever's going on to win a competition. And sometimes that happens, then they just don't. But that's fine. Like take out what you want to, if you want to learn something, I think it's useful to get your hands dirty enough that you can see when you're wrong. I think that's the biggest thing that comes out of this. It's like homework in school. Except on a grander scale, maybe, because you have these month long, two month long competitions, but a lot of times when you see the final solutions, they're not that much different from the ideas you had, but you must face the reality that you didn't have that idea.

 

And there's something to be learned for that. And you don't have to be competing for the top 10 to see these sorts of things. You can see some of the leading models. Sometimes they're simple, sometimes they're not. But at least, if you try it a little bit, you can take a lot more out of that. And I find if you're looking correctly, non-curated real world problems, most of the problems are real world. I've worked with a lot of companies at H2O and you can go in there immediately, usually, and improve a model that someone's been spending some real time on. And that's because we've practiced 5,000 of these things.

 

Wen Phan:

 

Any additional thoughts on the value of Kaggle besides competitions? Of course, I'm talking to the most competitive group, probably.

 

Mathias Müller:

 

Well, I think with the recent development of Kaggles. In the past, it was all about competitions, right? But nowadays you can contribute in many different ways. Like we had discussions, all those kernels, the three data sets, and it's the best learning resource out there. Simply put, and you don't need to be competitive. If you want to learn something, then go to Kaggle. So if you want to compete, go to Kaggle. If you don't want to don't want to compete, still go to Kaggle. It's simply the best resource. And with the new features like kernels, for instance, and all those three data sets around, there's so much to explore. And even if Kaggle would have no competitions at all from now on, it would still be a place I would visit several times a day.

 

Gilberto Titericz:

 

I think the best thing about Kaggle is that you can learn from the best competition solutions. This is the most valuable thing in Kaggle. Learning from the top players in the past competitions. You can find everything on the forums. And another thing about Kaggle, is that you can learn from the scripts, the open scripts there from the kernels, all the kernels. And there are a lot of things about Kaggle that you can learn without competing.

 

Marios Michailidis:

 

I agree. I think Kaggle is the home of data science and I think it lives up to its name. There are so many things you can do. Anytime I have a question, I just post it and get an answer fairly quickly about anything. Like, how can I do this visualization or where can I find an implementation of this technique? It has a news feed and if there's something new, people there will try it. I've learned image classification, text classification, sound classification. There's just so many things you can do. You might actually lose yourself trying to do everything.

And it has so many datasets. For example, I remember once, I wanted to find a dataset about what crimes were in a specific area and I was just able to find it. So it has so much. You can get value in so many different ways aside from the competitions. It's even tough to list it. I think people should just have a look and see all the options that are available in order to get a better idea.

 

Why is There so Much Hype Around Deep Learning?

 

Wen Phan:

 

All right, so we've been very pragmatic. Maybe we'll move on to the other spectrum. And just to ping your guys' brains. How much hype do you think there is around deep learning? A question asked recently a great approximation for Tyvek was found without deep learning and GCForest is a faster option for representation learning.

 

Dmitry Larko:

 

Well, there is a lot of hype around deep learning from my point of view. So it sounds like it's a shiny new thing and if you have a hammer on your hands, that means everything is a nail. That's exactly what happens with deep learning right now. Which is not a bad thing because any way you would like to try a lot of things, a lot of problems to solve this, this a new approach. Some of these problems can be solved with deep learning and some of them not, but that's how you learn. I disagree about the GCForest. To my knowledge it was tested against an AMNIS data set. Well, it's a very simple data set. So basically you can do a very good job with just a Random Forest base-code on this dataset. So I wouldn't say it's a good benchmark to train to test anything against neural nets.

 

Mark Landry:

 

I think like CNNs, if you're not doing CNNs, you're doing it wrong on image right now. And you can see one of the Kaggle competitions, Michael Yar, who's one of the top 10 Netflix prize winners, he named his team "I don't use CNNs" and he knows what he's doing and he landed in about 200th place trying to make a point and he made the wrong point. I don't know what he is going after, but he didn't make that point very well. Outside of that, I'm not sure. I'll let these guys talk about that, but it still seems like CNNs are definitely the way to go and we're seeing new architectures keep coming up.

 

Marios Michailidis:

 

It's another algorithmic family you need to know if you're in this field. I mean, there have been some areas that undoubtedly deep learning is really good, like image classification. But I would say that, if you want to be generally good all around you, you need to know them all. I do agree that there is some hype for something that has been there for many years. It's obviously the advent of computing power, especially GPUs, that has resurfaced and has made it come out again. But I would still say that it is definitely something you need to learn and study and new things will come out, but it won't be the answer to everything.

 

Gilberto Titericz:

 

Deep learning is really interesting. Up to now it was used a lot, but for image classification problems. But there are some other uses for it for tabular data and structured data also. So the bad thing I think about deep learning is that it's very hard to implement some models and it's very highly customizable. So you need to have a very good understanding of what you are doing to work with deep learning. It's not as out of the box as other models like decision tree models. Also every day there's newspapers about deep learning, new techniques, new architectures. So there's a lot of potential for the future and we can see that it's going to be very good in the future.

 

Mathias Müller:

 

I would say there's definitely some type of hype about deep learning and, of course, it produces state of the art results on many domains, but it's not that magic tool you can apply to every data set and it works out of the box. It's really hard to tune. It's much more difficult to tune deep learning models and tuning Boost models. That's even one art in feature engineering, I think, it gets the architecture right. And even the mastermind behind deep learning probably is currently heading into different directions to say okay, maybe that was a huge step for machine learning in general. But I think he tries to, or he sees some limitations already with a current approach in deep learning and wants to try to find new ways to mimic what our brains are actually doing and how our brains work, how our brains learn.

 

And I mean, there's still this open question of how you bring knowledge from different domains into the current problem? Because obviously when we learn, we use the information from the past and we use different information, and the learning problem is not something you can take in a vacuum. Like, if you see a picture, you already have the knowledge about structures to see the patterns. And, currently, how machine learning works is more like it's a closed system and I think the next big step would be to break this barrier and make those transfer learning work.

 

Wen Phan:

 

All right, folks, we're officially out of time. I think the Grandmasters and Masters are probably going to hang out for a bit if you want to just talk to 'em personally. But let's formally give 'em a round of applause for their time and all their great knowledge.