There is more to competitive Data ScienceĀ than simply applying algorithms to get the best possible model. The main takeaway from participating in these competitions is that they provide an excellent opportunity for learning and skill-building. The learnings can then be utilized in oneās academic or professional life.Ā KaggleĀ Ā is one of the most well-known platforms forĀ Data Science competitionsĀ , which offers a chance to compete on some of the most intriguing machine learning problems. People of myriad experiences take part in these competitions. Some people do extremely well and go on to achieve the title of Kaggle Grandmasters. In this series, I bring to light the amazing stories of Kaggle Grandmasters.
In this interview, Iāll be sharing my interaction withĀ Yauhen BabakhinĀ ,Ā Ā aĀ Kaggle Competitions Grandmaster,Ā Ā and aĀ Data ScientistĀ Ā atĀ H2O.aĀ i.Ā Yauhen holds a masterās degree in Applied Data Analysis and has more than five years of experience in the Data Science domain. Yauhen happens to be theĀ first Kaggle competitionsā Grandmaster in Belarus,Ā Ā having secured gold medals in both classic Machine Learning and Deep LearningĀ competitions.
Here is an excerpt from my conversation with Yauhen:Ā
Yauhen:Ā Ā I started my journey in Data Science with an online course onĀ EdxĀ . There was a Kaggle competition as a part of the course, and it was for the first time that I was introduced to the concept of Kaggle. I managed the 450th position (with a top-50 position on the Public Leaderboard) in the competition, but it was a great learning lesson and motivation to continue learning Data Science. After about six months, I took part in my second Kaggle competition and won a silver medal there.
Currently, Iām the first and the only one Kaggle Grandmaster in Belarus, but there are a good number of Kaggle Masters, and the Data Science community is growing really fast.
Yauhen:Ā Ā I would say that all the competitions are challenging, but if I were to choose, Iād say itās comparatively harder to participate in the Deep Learning competitions. Deep learning competitions entail long training times, so it isnāt possible to iterate fast. This makes it hard to evaluate all the hypotheses, and one has a much lower scope for making errors.
Yauhen:Ā Ā I usually wait for a week or two after the launch of a competition. If there are some problems with competition like data leaks, incorrect data, etc., these are located in the first few weeks, and a lot of time is saved.
Then, just like everyone else, I begin with the exploratory data analysis part followed by hypothesis generation, establishing local validation, and trying different ideas and approaches. Usually, I create a document where I store all these thoughts, assumptions, papers, and resources that, I think, could work for this particular competition. Sometimes such a document can even contain up to 20 pages of text.
In addition to the above methodologies, I make it a point to follow all the forums and kernels discussions during the competition to get different opinions on the same problem.
Yauhen:Ā Ā Itās hard to follow every resource possible as the progress in Machine Learning, and Data Science is happening at a very rapid pace. So, I try to limit myself to a specific domain or a specific problem that Iām working on at the moment.
First of all, I would name theĀ Open Data Science community (ods.ai)Ā . Itās mostly a Russian-speaking slack community with about 40 thousand members and channels on almost every topic in Data Science. One can quickly get information on any Data Science concept in a matter of seconds here. However, if I need a deeper understanding of any material, I go directly to some blog posts, papers, youtube videos, etc.
Speaking about the programming language, I started my journey in Data Science using theĀ RĀ Ā language, but now I mostly useĀ PythonĀ .
Yauhen:Ā Ā Iām currently working on the AutoMLĀ models in theĀ Computer Vision domainĀ . H2O.aiās Driverless AIĀ is already equipped to work with tabular, time series, and text data. Now weāre moving forward to use Driverless AI with the imagesā data for solving problems like classificationĀ , segmentation, object detection, etc
The idea is that once the user provides a CSV file with paths to the images and image labels, Driverless AI should automatically build the best model out of this data, given the time constraint. This would entail automating all the hyperparameters search such as the learning rate, optimizer, neural net architecture, etc. and the training process like the number of epochs, augmentations selection, etc. Moreover, this process should be efficient in terms of time and memory used. We are working on this and getting some positive results.
Yauhen:Ā Ā Iāve participated in several Kaggle competitions with the images data. It gives a nice background knowing what ideas work best on practice and which of them could be automated. Also, I continue reading solutions of the competition winners, even if I didnāt take part in a particular competition.
Such practical tricks are not generally described in any book or online course. Thatās why Kaggle allows one to stay on the edge of the field development and to improve our AutoML pipeline for the Computer Vision problems continuously.
Yauhen(3 from R) with other āMakersā during the H2O Prague MeetupĀ
Yauhen:Ā Ā I have substantial experience in a lot of domains, like classic Machine Learning, NLP, and Computer Vision. However, I have never worked with audio data. So, it would be interesting to apply some techniques for problems, as the classification of the audio recordings or natural language understanding, to name a few.
Yauhen:Ā Ā In my opinion, one should bear in mind that the Data Science journey should ideally be a combination of theoretical knowledge and practical experience.
Just reading the books, blog posts, skimming through online courses will not give you any hands-on experience. You will only obtain some theoretical understanding that you wonāt be able to apply in practice. On the other hand, going right away to the application would also become a monkey job. Running simple fit()-predict() and blindly copying public Kaggle kernels, without any understanding of whatās going on under the hood, would also lead you nowhere.
So, I would suggest picking a good online course and completing all its exercises. Additionally, participate in Kaggle competitions and apply the theoretical methods that youāve just learned from papers, books, blog posts, etc. In this way, youāll have a firm grasp of the fundamentals and will understand their practical utilities also.
Yauhen(7 from L) with other H2O.ai Kaggle Grandmasters during H2O world New YorkĀ
This interview is another great example to show how people like Yauhen work hard in a systematic manner to achieve their goals. They define a clear path and work towards their goal. Another great takeaway from this conversation is that having good theoretical knowledge is only useful to a certain extent. Still, the real test of your understanding is only when you put your learnings into practice.Ā