Return to page

Kaggle Grandmaster Panel - H2O AI World London 2018

 

This video was recorded in London on October 30th, 2018.

 

Talking Points:

Speakers:

  • Mikel Bober-Irizar, Machine Learning Engineer, ForecomAI

  • Darragh Hanley, Senior AI Researcher, DoubleYard

  • Mathias Muller, Data Scientist, H2O.ai

  • Branden Murray, Principal Data Scientist,  H2O.ai

  • Marios Michailidis, Data Scientist,  H2O.ai

  • Dmitry Larko, Data Scientist, H2O.ai

  • Jean-Francois Puget, Distinguished Engineer, Machine Learning, NVIDIA

  • Sudalai Rajkumar, Head of AI & ML, Growfin 

Read the Full Transcript

 

Moderator:

 

What a treat. All Kaggle Grand Masters. Eight of them. I think there's two more in the audience. Maybe they left already, but if you're around somewhere, you're welcome to join the stage. This is the largest concentration of Grand Masters in the world, so applause to those guys. Let's start with the youngest who can guess his age? Very good, 17. Please introduce yourself.

 

Mikel Bober-Irizar:

 

I'm Mikel. I'm the youngest Kaggle Grand Master. That's my gimmick. I'm 17. Started Kaggle about three years ago, and since then, I've sort of just been self-learning and doing competitions and getting better at it.

 

Darragh Hanley:

 

Hey, I'm Darragh, and I've been at Kaggle, I say about three or four years. I'm an AI engineer in the healthcare space with Optum. Doing some interesting work there and also trying to learn as much as I can on Kaggle.

 

Mathias Muller:

 

Hello, I'm Mathias. I'm a Data Scientist here at H2O. I'm working primarily on developing driverless AI. My name at Kaggle is Faron, and it's just an awesome platform and if you're not competing already, just get started.

 

Branden Murray:

 

Hi, I'm Branden. I'm a Customer Data Scientist at H2O. I joined Kaggle five years ago, but I only knew Excel at the time, so I spent another year and a half or so learning R. I've been on Kaggle competing actively for about three and a half years.

 

Marios Michailidis:

 

Hello, I'm Marios. My Kaggle name is KazAnova. I've been working for H2O almost a year and a half now as a competitive Data Scientist, mostly on driverless ai. I have my first pair of Kaggle competitions. I think I have participated in more than 120.

 

Dmitry Larko:

 

Hi, my name is Dmitry Larko. I am a Kaggle grandmaster. I participate in a lot of different Kaggle competitions. I opened my Kaggle to myself like seven years ago, and I'm an H2O Data Scientist.

 

Jean-Francois Puget:

 

I'm Jean-Francois Puget, known as CPMP on Kaggle. I'm a double grand master, and I'm better at speaking than competing because I'm first on the forum. I work for IBM doing machine learning. Machine learning is a long term interest because I had a Ph.D. from the previous millennium. As you can see, I'm not the youngest here, but machine learning has really been my interest for many years.

 

Sudalai Rajkumar:

 

Hi, I'm Sudalai Rajkumar, called SRK in Kaggle. I'm working as a Data Scientist at H2O.ai. I've been Kaggleing for about five years now, so I'm a grand master in both competitions as well as the internal section.

 

Advice to Data Scientist

 

Moderator:

 

All right. Let's get going with questions. If you have questions, please submit them to Slido. What is the number one advice you would give to an average data scientist that you already know they need to notice? It's something they just have to listen to.

 

Mikel Bober-Irizar:

 

I think what I would say is that there's a big difference between learning something in theory and actually trying to do machine learning in practice. Because, like you've seen the rest of the day, it's all about feature engineering. It's all about figuring out what's going to be good and sort of just getting your own ideas and implementing them. As opposed to actually knowing the equations is not that important. You know, because someone's done the equations for you. It's still programmed for you. It's just about getting that intuition, and so my advice is to get stuck in, do Kaggle competitions and get that intuition.

 

Darragh Hanley:

 

I totally agree with that. One thing I find myself struggling with a lot is when you develop a pipeline or something, not getting stuck in that for too long. Trying to move to something new because I think we're all in for learning. What we can learn from new methods is to use engineering models to try and constantly push yourself to move on to the next thing instead of spending a lot of time just tweaking one pipeline. Generally, you get more out of it during the competition, and also you learn more that you can apply in the future.

 

Mathias Muller:

 

One important thing for myself was after I finished my studies, I started after that with Kaggling, and I was quite confident actually because I said, oh, I have such a good educational background now. Then I got into my first competition, and I got grounded a bit and then all the fun experience kicked in. I really wanted to learn what I'm missing, and I figured out that I missed a lot so that academia teaches you not the things that are really necessary out there. You have to figure it out on your own. I mean, that's always been the case at a University. A lot of things you really need, you have to teach yourself. Kaggle is really the best platform to get started if you want to get better at all the things which are part of a data science pipeline.

 

Branden Murray:

 

I don't know if I have too much to add compared to what all these guys just said, but I think I agree with Doug about just trying different things. I feel like sometimes, in a competition, you just get stuck doing one thing and trying to perfect it, and then you realize after the competition that someone tried something else that you were thinking about trying, but you just never had the time to do. I think taking more of a shotgun approach towards things rather than, I guess, focusing on one single thing.

 

Marios Michailidis:

 

I'm not sure what the question was.

 

Moderator:

 

The single best advice you would give to young data scientists.

 

Marios Michailidis:

 

Now your answers make sense. I think you need to get your hands dirty. Don't get demotivated because even the work in science, it might be a bit tough sometimes. You see all the piles of notations people use, and it might feel a bit intimidating to enter into the field, but you shouldn't. I mean obviously, it takes a certain amount of time, but I've been able to do it from an accounting background. I know a lot of people who have been able to do this with very little, not very strong background. I think it's a matter of putting in the time trying to learn. It is a very open community. The data science community is one of the best, more generous communities out there, dominated by open source works. People will help you. Don't be afraid to ask. Generally, that's my message, get your hand dirty, and don't be intimidated.

 

Dmitry Larko:

 

We have a lot of advice as this already, so I can actually make it like some sort of a summarization. You're supposed to know your algorithms, but that's something that everybody knows. You are also supposed to know the metric you're trying to optimize, and know the data. Especially, keep in mind the difference between train and test data sets on Kaggle competition. Because in some competitions that actually can give you a lot of insights.

 

Jean-Francois Puget:

 

I would say the hardest skill to get is to properly evaluate model performance. It's very easy to get full download, install open source, trend model, and get 99% whatever accuracy or any other metric. Your model is just learning the training data and not generalizing. Entering a Kaggle competition is a wake up call if you don't do it right. You believe you are good along with the competition, and the last day when the private data set is used, you discover that your model was awful. Learn how to evaluate your model properly. It's hard because you must not overfit the training data. Practice, Kaggle, is one way to learn that. To me, this is the most important skill.

 

Sudalai Rajkumar:

 

In my case, as people are mentioning, being hands-on is one thing which has helped me a lot in grooming my career. In the case of Kaggle competitions, I try to participate in multiple competitions, so I try to solve different types of problems. In that way, I improved on my data science skills.

 

How Many Hours Do You Spend a Day?

 

Moderator:

 

I wonder how many hours you spend a day? Is it 0.1, or is it five? Over the last year or two, your average?

 

Mikel Bober-Irizar:

 

If you're asking the average, unfortunately probably over an hour a day for me. But it really depends on if I'm doing the competition seriously, and it's coming up to the last week of the competition, then I think we can, it will say that we spend way too long, doing it sort of missing sleep and all that. Then again, there are times when you're not working on a competition, or you're just doing one for fun, and then it's really whenever you feel like it. I think overall, if you're trying to be in sort of a prize winner or a top 10, it's not like, oh, we just press a button, and it works. You really spend a lot of time doing it.

 

Darragh Hanley:

 

The answer is a bit embarrassing sometimes, about how long you spend on it. I'd say when I find that if I'm in a competition and I start to do well early on, I end up spending a lot of time in it, maybe two or three hours a day, maybe a bit more at the weekend. That'll be on top of work a lot of time, but I enjoy it. I do it mainly out of learning new techniques because I find it interesting. I find that instead of watching Netflix, do Kaggle.

 

Moderator:

 

It takes 10,000 hours to be a grandmaster at anything. They're probably underestimating the hours even.

 

Mathias Muller:

 

I cannot tell how many hours I spend on average. I don't know. But definitely, if I'm really into a competition, it's a bit like an addiction. For me, it's like I have so many ideas in mind, I want to double them out, and it's really painful to have to wait. I know I have to implement the stuff, and then I have to try them, and then I see, oh, it's not working, damn it, next one. The list gets longer and longer, so sometimes it is like a motorway. How can I find sleep and get out of it if you're really diving into a competition and you have actually a chance to compete towards top 10 or better. Then for me, personally, I use every free minute I get for it.

 

Branden Murray:

 

For me, it could be sometimes 10 hours a day. I get to do it for work sometimes, so I'm doing research and stuff like that. One of the things we're working on at H2O is reading documents or trying to figure out where the invoice date is. I was using Fast R-CNNs for that, which happened to be what I used. Right now, there's the RSNA making boxed boxes around where pneumonia might be. For that competition, I probably spent like eight hours a day just looking through the code, trying to make changes, trying to figure out how it worked and what changes I would need to make to make it work for our process. For something like that, I might spend 8 to 10 hours a day. But for the home credit competition, which my large team of 12, we got second place. But for that I didn't do a whole lot. I wrote some code that was in a loop and made 20,000 features or something like that ridiculous. That's pretty much it. Maybe an hour, two hours a week maybe on that one. Not too much.

 

Marios Michailidis:

 

Probably more time than most, I should not embarrass myself further, but I should say in my defense. But there is good overlap with my work.

 

Dmitry Larko:

 

In the last competition, I was able to end up a 6 out of 2000 something. It was my latest solar gold medal. Of course, how many hours I actually spent, I spent 4 hours a day for three months. That's a lot.

 

Jean-Francois Puget:

 

It depends if you include dreaming or not. At the end of my current competition last night, it was not a nightmare, so that's good. I would say two, three, four hours. I have a day job. There is some overlap, but maybe not as much as for you, Marios. Spending a bit every morning preparing runs, watching them during the day, and finalizing in the evening.

 

Sudalai Rajkumar:

 

I spend around three to five hours on average, but my system used to spend around 15 to 20 hours.

 

How Do You Work?

 

Moderator:

 

What's your process? Do you just do random attacks, you try all kinds of stuff? Do you first set up a validation scheme or think about the metric? Maybe you reframe the target column first, or how do you do it? Does it depend on the problem? Of course, it always depends, but is there something like a systematic way to make a better process? How is it that you made it into this stage, let's say?

 

Mikel Bober-Irizar:

 

I think that in an ideal world, like when you start a competition, first thing setting up validation is really important because you want to be able to work out locally how your model is so you don't have to keep uploading to Kaggle. Then also, sort of visualizing the data and trying to figure out, sort of find interesting bits of information that you can use, feature engineering, things like that. In practice for myself, what I tend to do is I tend to download the data, stick it into XGBoost, upload it, and then sort of sit happily for a couple hours while I'm at the top of the leaderboard because no one else has downloaded the data yet. That's my approach. Once I am there, I'll go back, and I'll start looking at the data more and try to do that, but I really should do more of what I preach.

 

Darragh Hanley:

 

I think the pipeline's pretty similar. At the start, I try to get an idea of how well people are doing on kernels and then set up a validation set. I think that's something that's very important, and it's often overlooked having a good validation set. Run a simple model yourself locally to make sure you can get close to whatever the scores on the leaderboard are. Then just spend a lot of time trying to understand the data. A lot of time working with that part.

 

Mathias Muller:

 

It's getting repetitive, but the first step is to set up a good foundation for cross validation. I always spend my first couple of submissions training a model, and I know that it's not good, and I improve it a bit. I want to see a relationship between my validation score and the leaderboard. If it's not there, then I have to think again about, okay, what's going on? Why is the validation not working? That's the first step of the whole process. I'm not starting with deeper analysis or more modeling before. I don't have a solid validation framework established because that's the bread and butter of everything else that is following. If your validation is bad, everything that comes after is not working.

 

Branden Murray:

 

Pretty much the same as far. I'm not going to bore you with repeating it.

 

Marios Michailidis:

 

I think, on top of what the guys have said, I will add what determines the winner is, obviously, the time you put in. You have more time to search for different things, patterns in the data, making good partnership that science is a team sport. Being able to find people that can compliment your skillset, have different experiences on different techniques, or that are able to seize the data from different angles can lead to very diverse solutions and normally better results. Having access to the right tools and having good hardware support. You can run more experiments at a larger scale, cover more ground, some level of automation. I think it's quite important. You don't have the power to search for everything. You need to kick in automated things while you focus on the things where it can really matter where you can extract more information. Cement individual understanding of the data, try to get a bit more context and get specific ideas of how to solve specific problems, I think can really help. Then a little element of lack I think, is also a good boost.

 

Jean-Francois Puget:

 

They said it all already. I would just add, trying to see if other people have not worked on the same problem and published papers. Especially if you use deep learning, it's evolving so fast. I see in the deep learning competition the ones that win are the ones using the latest papers.

 

Why Are There No Women Grandmasters?

 

Moderator:

 

Alright, let's now move on to the Slido questions. Thanks again, grandmasters, for answering all the questions so far. Let's pick the first one. Why aren't there any women Kaggle Grand Masters on the stage? Are there any in the audience? Does anybody know anyone? We would love to encourage participation. I mean, it seems like it's male dominated, and it's possibly true, this bias in this data. Anything we can do?

 

Mikel Bober-Irizar:

 

I think that it's important that people don't think that you need to do a degree or whatever to get into data science or into Kaggle. I think that it's almost a softer science than people think at first because it's really about experience and all that. I think encouraging people who maybe don't have a background in machine learning to also give it a go see if it interests them. I think opening it up to more people would be quite good.

 

Moderator:

 

The best software engineers were women in the early days. The Apollo space mission was all programmed by women. It's definitely that the attention to detail is actually better for women. I think that's the state of the art of the researcher. Men might be just more bullets; what is it called? The shotgun approach. Maybe there's some kind of brain chemistry that makes it try everything.

 

Marios Michailidis:

 

To be fair, there is improvement, and there are female grandmasters. I don't see any serious obstacles right now to participate in that platform that would be based on gender, for example. It's just a matter of feeling comfortable that as long as you don't feel bad if you don't get it with the first try, that's the only thing that can keep you back, in my opinion, on a platform like Kaggle.

 

Driverless AI in Kaggle

 

Moderator:

 

What's the point now that you have driverless AI? Why even compete in Kaggle? Or, actually, I would rephrase it. Is there any use of driverless in Kaggle?

 

Dmitry Larko:

 

We use driverless here to get an insight about the data during the Kaggle competitions. We're not allowed to use it for final submissions, that's for sure, because Kaggle is a commercial software, but it doesn't stop us from using it to get insight with features that actually interact the most. I think Serge, in his presentation, actually compared to autopilot, basically. It is basically autopilot, but you still have to take off your plane and lend it back. There are actually additional things you need to do. You still have to think about carefully designing your experiment, your validation schema. You still need to try things by yourself. Like a pilot actually learning to fly, even if they do have an autopilot available, they still actually know how to land it and what, how to operate the airplane. He's pretty much the same. Driverless AI, it's an extremely useful, sophisticated automated tool, but doesn't have any insights about the data you can actually use. It doesn't have any domain knowledge, it performs like a huge, relentless search. It helps you to speed up your search but doesn't replace your insights and your looking into the data.

 

Marios Michailidis:

 

It is also meant to be production ready, which means that it may not be always suitable to win a Kaggle competition. What I mean, for example, in a Kaggle contest, you're allowed to look at the test data. Obviously, you don't know the answer, what you're predicting, but you can see the distribution of the test data, and you can use that information tool to enhance and read your training data. This is something that quite often we do. I mean, you probably don't do it in a production environment, but you're allowed to do it in this competitive context because test data is available. You see, there is still room for manual intervention, but in any case, an automated process. I agree that race is the bar, but if everybody has access, we can just push the boundaries even further with individual efforts.

 

Sudalai Rajkumar:

 

In general, when I start Kaggle competitions, what I used to do is like, say, I take the data set and run XGBoost model or light GBM models previously. Now what I'm doing is, like say, I'm just running a driverless AI and getting a baseline and also getting the important features out of it. Now that the baseline is high, I need to work more hard and to beat that. That's how I'm improving myself by using driverless AI.

 

Mathias Muller:

 

Another thing maybe said tools like driverless AI can actually take the boring part away. Actually, that's what the data scientist has to do but doesn't want to do all the time. Like, deliver with how I deploy models and stuff like that. I'm okay with when an automated system makes it for me because I don't want to deal with that. For me, it's boring. I want to get more data inside. Actually, what humans access that, I want to spend time on. Tools like driverless AI is just a help to find more time for the really cool things. I think it's a little bit like chess. I mean, nowadays, the best chess engines and no grandmaster in the world can beat them, but they're still playing chess. We could ask a question, why are people still playing chess now that the computers are so much better at them? But they don't use those computers to improve their own games. Finding more insights. Combining the power of artificial intelligence and the human brain. I think that is also an interesting point of view on the whole thing because, currently, it's still really separated. We don't have that deep understanding of the human brain yet. Artificial intelligence is still kind of shallow, but that will change, and that is more like I would look at this.

 

Moderator:

 

There was a presentation earlier today that showed that chess is enhanced by humans, that the robots plus the humans together beat the robots always. 58 to zero or something. The combination of humans with machines is always better. I think the same is true here. Your Kaggle insights, we'll help your performance as a data scientist, and it doesn't hurt that the machine is there as well.

 

What Tools Do You Use?

 

Do you use R or Python? Can you give a very quick, short overview of what tools you're using? Or do you use Scarla, or what is it? Do you have a supercomputer at home or just a laptop?

 

Mikel Bober-Irizar:

 

I'm using Python mainly, pretty much for everything. I think we may have a couple of R users here who are still sort of stuck in those dark ages but never mind. In terms of computing power, having access to a lot of it is helpful, and I think we all have access to quite a lot. But there's a lot you can do with just a normal desktop or a laptop. It's really just that extra edge, which I will admit is there from having a lot more computing power. In terms of favorite tools, I'm still quite a big fan of XGBoost. It sort of seems to work on everything. I know everyone else has switched to Light GBM, so there's me in the dark ages, but I still seem to do better with it. I prefer it. I like XGBoost.

 

Darragh Hanley:

 

For myself, I'm an R user, and I've found the data table very good. It's great to see the presentations here today. I find that very good for munging data because it's coupled with R. It does very good visualizations. You can quickly iterate over ideas, and it's quite flexible to manage ideas, and see the results of that, see how it would look like within a model. At the moment, I'm spending a lot of time with Chaos and TensorFlow because I think there's a lot of levels to what you can do with deep learning. Spending a lot of time with that, and I enjoy that. I was thinking it's sort of a love-hate relationship. I like it, but it hates me.

 

Mathias Muller:

 

For me, Python is mostly coupled with C++ if it's like time critical code. Basically, all the libraries like XGBoost, LightGBM, or Python, et cetera. The common stack, which is now used by almost everyone, I would say. That's what I'm using as well.

 

Branden Murray:

 

I'm an R user, mostly. I do deep learning stuff and Python, but in R, I use a lot of data.table. Shout out to Matt, who does a lot of work on that. For deep learning, I've been using PyTorch lately. I used to use Keres and PyTorch a lot more. Obviously, XGBoost or LightGBM for other machine learning stuff.

 

Marios Michailidis:

 

XGBoost, LightGBm for gradient boosting, Pyspark for Random Forest implementations. Lipases VM for linear models and support vector machines. Follow the regularized leader normally in Python implementation. TensorFlow with Keres for deep learning,

 

Dmitry Larko:

 

It's interesting. Nobody mentioned H2O, actually, so I'm going to do that. H2O. I love PyTorch, but usually, I mostly use Keres. Don't know why, it's kind of a strange relationship. Obviously, you guys actually mention everything else. On Kaggle, you always have like a more or less classical set of algos you're always going to use.

 

Jean-Francois Puget:

 

Python is the same, but I spend time writing efficient Python code. It's very easy to write very slow code in Python, and if you have a limited time to compete, it'll be bad.

 

Sudalai Rajkumar:

 

Like Marios was telling, I am also a Python user, so I use what Marios was telling most of the time. In addition, recently, I mean I also started using driverless AI to get that feature interactions, which features are useful, and so on, so that I can just use them in my models.

 

Does Your Design Match Real World Validation Sets?

 

Moderator:

 

That second question, does your validation set design match real world validation sets? Or, more generally, what you do in Kaggle is that useful for real world production users? Or are you really just tweaking some kind of a leakage problem and playing Sherlock Holmes in that local CSV file? Or is it actually useful information that you're learning?

 

Mikel Bober-Irizar:

 

Of course, there are leakage problems, and that's happened a lot lately actually on Kaggle, where sort of, there's accidentally been additional information in the data, which lets you on the leaderboard that perhaps you wouldn't have in real life. But I think, for the most part, it is sort of applicable to the real world. What tends to happen is when you do a Kaggle competition, the top solutions will be sort of monstrosities of ensembles that actually, when you rerun them, you'll get different results. Then the organizers can do it from the top solutions. They can take the insights that are in there and sort of make like a diet model with sort of just the useful stuff that is maybe a little bit worse. I think it's still useful for the organizers. In terms of whether the validation set matches, I think once again, that's up to the sort of the organizers, if you want to be able to predict the future, then you should give us a test set that it's in the future. I think, for the most part, Kaggle is quite useful for companies. I mean, that's why they keep coming back, but they will have to do some tweaking on those solutions.

 

Darragh Hanley:

 

It's a good question. I think there's a lot of overlap more than is sort of recognized sometimes. I think the thinking of how to represent data for a problem is quite important. Getting comfortable with tools makes you better able to think about other things outside of the tool if you're not as worried about whether I am using the tool correctly or whatever. I'd say from those two perspectives. I've seen a lot of use cases professionally that could be drawn on IDs taken from Kaggle.

 

Mathias Muller:

 

I would also say that's really useful because actually, the problems that are, in Kaggle actually, are real world data sets, so sometimes they are not probably set up, and leakage can make a huge impact, of course. If we, as competitors, find a way to exploit leakage, we do it because it makes our score better. That's how the game works, actually. But consider the same problem now as in your day job, and you are not able to identify that leakage. I don't want to know how many models out there in production are really bad because they are trained on leaky data. Actually, that is a really important skill to develop for yourself. To identify is that model what I'm building here is actually useful? Is there any leakage I have to think about because most of the time, those leakage issues start rising at the very beginning of the pipeline? The data collection or data storage stays and not necessarily only at the stage of the modeling itself. I think that makes it even more useful because you acquire skills there that you would really need as well in your day-to-day job if you're a data scientist and you have to build such a data science pipeline.

 

Marios Michailidis:

 

I also have very little to add here. I mean, it was basically pretty much covered, but obviously, there is a lot more going on from the point you collect the data, you set up the problem, and then how you take the results and you productionize, and you utilize that. Sometimes it is very difficult, and people are critical about Kaggle having only one data set. You need to make it so that  it can be used in a competitive context. You have a common measure to test everyone, and quite often, there are pitfalls. People introduce leakage, introduce stuff that shouldn't be there. I think having that experience, you definitely learn a lot about what is the best way to represent your data to be able to avoid such problems. Generally, this is a very obviously important matter which needs to be very analyzed.

 

Jean-Francois Puget:

 

Have you hacked the Kaggle scoring? What people do is not scoring is documented. What people hack are the target of the test data. This is leakage, all this discussion, it's amazing how good the Kaggle community is at finding leaks, finding signals that the organizer thought they had removed. It's amazing. If there is a leak, it'll be found. This happens in reality as well. The ability of finding a leak, a leak, is you believe you have new data, nothing about the target, but in fact, you included the target somehow in preparing the data. This happens in reality as well. It's a big danger. Learning how to detect this is also useful. Even if it looks like a drawback, it's actually very useful.

 

Predict The Future?

 

Marios Michailidis:

 

Maybe pick another question. Do the nine of you grandmasters ever think of yourselves like Oceans Eleven and want to predict the next lottery win? We wish we could. It's more difficult. I think most of us, we are engaged I think in some form of activity to predict something, which may yield some return. I know some people do Bitcoin predictions over there. Mikel? I also try to optimize my own. I mean, maybe that's the problem. If we work together, we can make more.

 

Dmitry Larko:

 

I also would like to add to you that it's not very efficient to predict the next lottery win because the pool is too small. We have to wait for the pool to increase, basically. As soon as the jackpot becomes bigger, that would be more effective spending your time for, so the next lottery win is useless to predict.

 

Methods Science Based?

 

Moderator:

 

We also know that half the money is gone already. It's not a very good way to make money. The expected return is pretty low. Is it a science, or is it all just random hackery, trying to make gold out of a rock?

 

Dmitry Larko:

 

It's both, to be honest. It does have actual science inside because you still have to create hypotheses and test them. You still have to be variational and not emotional about your beliefs on this data in the prediction of your model. You still have to basically criticize your approaches over and over again, and build the right validation. It's a scientific approach in data science, basically into competition. But it does require some hackery as well, like deployment for example. There's tons of hacks, and actually not being proven. There's not any theoretical proof at all, but they work, so we just use them.

 

Jean-Francois Puget:

 

I think it's a science by the method. You make a hypothesis. You design an experiment to test it, and depending on the result, your hypothesis is confirmed or not. The basic hypothesis is this feature is going to help my model. It's a simple hypothesis, but you have to test it in a scientific way to make sure that it's true or not. If you've trained in a scientific domain, you can do well in data science. For instance, physicists, they usually become great data scientists. It's not by chance. Then, of course, there is also a black magic as well, but there are some signs.

 

Marios Michailidis:

 

Have you hacked the Kaggle scoring algo? What was that answer? Oh, sorry. In any case, I mean, this is how we made it that high. We found a way to secretly increase their scores. No, but occasionally people do find faults in either in the scoring system in general, in a specific competition. Generally, it is quite an honest community. These faults are identified and resolved.

 

Sudalai Rajkumar:

 

In one of the competitions, actually what happened was the metric was AUC. Some point of their plan was to actually predict the actual predictions. Then did one minus of those predictions. It gave a score of 0.0 something, from the start. Finally, on the very last day, he just reversed the scores, and that way, like he hacked the system to get into the first place. No one else knows that he's actually improving his score. That's one way people hack.

 

Mikel Bober-Irizar:

 

I want to point out a question. Someone asked whether they should fear joining Kaggle, and that's something I hear quite a lot, but it sort of doesn't make sense to me because it's not like you get kicked out of the secret club if you're at the bottom. When I joined Kaggle, I sort of barely knew how to write Python. I had no idea what I was doing. I was going to Kaggle kernels taking code that people have posted, changing the parameters, and just sort of clicking run over and over again. Back then, I wasn't doing very well, but it's sort of doing that led me to actually figure out what was going on. It's not a case of learning how to do machine learning. Okay, now I'm ready for Kaggle. I feel that that's actually part of the journey. Even if you have no idea what you're doing, it's still good to have a go.

 

Any Tricks to Help Score

 

Moderator:

 

In German, there;s a saying. The path is the target or something. Keep doing what you like doing. Of course, I guess you have to like it. You can't force somebody to do this. But if you have a little bit of this addictive personality, you will definitely enjoy it. I think it's good for your career. It's good to learn this stuff. Even though there are tools out there like driverless, you don't need to stop thinking about the depth of the trees. It's better if you understand what that is. But you shouldn't have to do everything by hand all the time. Are there any last quick hacks that you can share? Like some kind of a secret trick? I know Dimitri has a secret trick that he applies at the end of Kaggle. I'm not sure if he's willing to share it, but are there any short summary tricks that you say just do this or give us a hint of what you did to make your score go up?

 

Dmitry Larko:

 

Sleepless, try more.

 

Moderator:

 

No sharing.

 

Mikel Bober-Irizar:

 

Want to keep our positions right. Can't give it away.

 

Moderator:

 

Great.

 

Jean-Francois Puget:

 

Never give up until the last minute. You can make progress.

 

Moderator:

 

How about taking other people's submissions and averaging it in submitting that? I mean, everybody is doing that now.

 

Jean-Francois Puget:

 

Beware with whom you team. There are some rules on Kaggle, and if you break them and you'll get caught, you are removed. It happened to me, but I was not the guilty one.

 

Marios Michailidis:

 

Look for last minute kernels that have high scores. If you're quick, we might just click it and get a good score.

 

Jean-Francois Puget:

 

Keep a submission till the last minute.

 

Marios Michailidis:

 

The last minute. Keep following the results.

 

Dmitry Larko:

 

The results are actually good advice. Do not start a competition earlier because somebody can find a leakage that has been used. Waste your time, basically. Just wait for a couple of weeks. People will write kernels, people will find the leakage, and you'll just use their findings.

 

Two Submissions in Kaggle

 

Moderator:

 

What do you think about the fact that you can make two submissions? It's not quite fair. You should have one shot, not two shots. You can always have some kind of a backup, and in real life, you might not have two shots at going to production.

 

Mikel Bober-Irizar:

 

Kaggle has a thing where for the private leaderboard, you choose any two submissions. Then the best score from those two is the one that actually counts. I think that it's not that indicative of production, but it's nice because it allows you to try two different ideas that you don't know. One might actually be better than the other. I think it sort of gives you that peace of mind and sort of lets you sleep before the competition ends.

 

Marios Michailidis:

 

A good strategy, I guess, to select which these two submissions might be because you might have thousands of models where you could select. Something that I have done and kind of helps is I normally pick the one which works best for me internally in my internal validation framework. I also select the one that works best on the public part of the leaderboard. I selected these two.

 

Moderator:

 

That was a tip. Thanks again for coming. Thank you.