September 16th, 2019
From Academia to Kaggle and H2O.ai: How a Physicist found love in Data ScienceRSS Share Category: H2O Driverless AI, Machine Learning, Makers
By: Parul Pandey
Learning and taking inspirations from others is always helpful. It makes even more sense in the Data Science realm, which is continuously being bombarded with new courses, MOOCs, and recommendations with every passing day. Not only such a lot of choices become overwhelming but also perplexing at times. With this thought in mind, we bring to light the stories of established Data Scientists and Kaggle Grandmasters at H2O.ai, who share their journey, inspirations, and accomplishments with us. These interviews are intended to motivate and encourage others who want to understand what it takes to be a Kaggle Grandmaster.
This time I had the chance to interact with Bojan Tunguz, a Double Kaggle Grandmaster, and a Physicist. Bojan Tunguz, who was born in Sarajevo, Bosnia & Herzegovina and migrated to the United States in 1993. He holds a Ph.D. in Physics from the University of Illinois and a Masters in Applied Physics from Stanford University. He has also taught various Physics courses in reputed universities like Stanford, DePauw, Rhodes College and the University of Illinois.
Bojan is just a single solo gold away from being a Triple Grandmaster, and we are hoping he achieves the feat soon.
Here is an excerpt from my conversation with Bojan:
You have a background in theoretical Physics. How did the transition from academia to industry happen?
Bojan: To paraphrase Hemingway, first suddenly, then gradually. Academic jobs are tough to find, especially in oversaturated fields such as theoretical Physics. They are ten times harder when you are an immigrant. At one point, I decided that settling down and starting a family was much more important than chasing ever-elusive academic appointments. So I was forced to look into alternative options. Initially, I dabbled in tech writing and small-scale desktop manufacturing, but once I discovered data science, I was hooked on to it. Data science provides the right balance between science and technology that I’ve always looked for.
How hard is it to become a double Kaggle Grandmaster. What initially attracted you to Kaggle and When did the first win come your way?
Bojan: I am currently a Grandmaster in Kernels and Discussions and one solo gold away from being a Grandmaster in competitions as well. Each one of those categories requires a different set of skills and sensibilities. Competitions are by far the most demanding in terms of technical skills. For discussion and kernels, other skills such as communication, social skills, writing, and data visualization are essential. Equally important is understanding the dynamics of Kaggle competitions: what to post and when; and what content other Kagglers find valuable/interesting/amusing.
Ever since I decided to pursue Data Science, I’ve heard about Kaggle. It is almost impossible to get serious about Data Science and not know about it. However, I was initially hesitant to join since I assumed that it would require a high skill level to get anywhere. However, once I joined, I realized that there are a lot of valuable resources available on this platform. Discussions, kernels, past competition solutions, online blog posts, etc. can help beginners get up to speed quickly. Once I became reasonably good, I also started teaming up more frequently, and I’ve learned so much from each one of my team-ups so far.
I won my first gold about a year and a half ago in the IEEE camera sensor recognition competition. I also secured the first place in the Home Credit default risk competition a year ago. It has been, by far the most memorable and amazing Kaggle experience for me.
I was initially hesitant to join since I assumed that it would require a high skill level to get anywhere. However, once I joined, I realized that there are a lot of valuable resources available on this platform
How do you decide which competitions to participate?
Bojan: That’s an easy one: I enter all of them! 😃However, I usually pick one or two at a time. I look into many different factors when choosing a competition. How much of the modeling pipeline I already have in place, how much effort it is required to get just a good baseline, how stable are the predictions between the local score and the leaderboard, and, very importantly for me, how much of a boost can I get from ensembling.
How do you typically approach a Kaggle problem? Any favorite ML resources that you would like to share?
Bojan: I usually first start small and then gradually build upon that. I don’t try to get the “best” solution right away.
My toolkit is pretty standard for Kaggle: pandas, numpy, sklearn, XGBoost, LightGBM, Keras. I work in Jupyter notebooks, and I am a big fan of Jupyter Lab.
As a Data Scientist at H2O.ai, what are your roles and in which specific areas do you work?
Bojan: Most of the time, I am focused on the outward-facing role(s) where I help our clients with the data science and ML problems that they are working on. I also contribute to Driverless AI, where I focus on various feature transformation “recipes” and unsupervised learning.
If you were to team up with grandmasters at H2O.ai, who would they be and why?
Bojan: All of them! I have already been able to team up with Olivier Gralier even before I joined H2O, and now I am in a competition where I’ve teamed up with Dmitry Larko. Many of the people at H2O.ai are the ones whom I’ve looked up to, for years. It is such an incredible honor to be able to call them colleagues now. It is my dream to team up with as many of them as possible.
What are some of the best things that you have learned via Kaggle that you apply in your professional work at H2O.ai?
Bojan: One of the most important “meta” skills that you learn from participating in Kaggle competitions is to be able to adapt and try many different things quickly. Furthermore, since most of your “great” ideas have a likelihood of failing, you learn always to have at least one contingency plan, and often half a dozen of them.
Data Science domain is rapidly evolving. How do you manage to keep up with all the latest developments?
Bojan: This is one of the main reasons why I keep participating in Kaggle competitions. For each new competition, I can apply at least one new technique, library, or framework. Kaggle competitions are probably the single best way of keeping your applied ML skills sharp and up to date.
Are there any specific areas or problems where you would want to apply your expertise in ML?
Bojan: I enjoy NLP and wish I could spend more time on those kinds of problems. In some ways, those sorts of problems are closest to what “classical” AI was supposed to tackle.
I also enjoy working on FinTech problems, but in my experience, FinTech is still much more Fin than Tech. I feel that we are still only scratching the surface of what can be accomplished with ML in that domain.
One thing that I have been fortunate to do well on Kaggle lately are the “pure” science problems. These problems range from protein classification to predicting scalar coupling constants using just ML techniques. “Pure” science is yet another area where advanced ML skills can potentially have a huge impact.
A word of advice for the Data Science aspirants who have just started or wish to start their Data Science journey?
Bojan: There hasn’t been a better time than this, to start working in Data Science domain. Data Science, according to me, is the most open professional field today. Whether you are just beginning or are a “seasoned” Data Science veteran, there are tons of resources out there that can help you on your journey: informative textbooks, blogs, and webinars; Kaggle; MOOCs; online forums; open-source software; accessible practitioners. Familiarize yourself with all those resources. Come up with a reasonable plan that works for you. It’s essential to build your technical skills but invest time to work on your “soft” skills as well: writing, communication, networking, etc. Be patient with yourself and allow yourself time to grow and develop. Don’t be afraid to fail. Make the point to learn from your failures. And, most importantly, try to have fun and enjoy your journey in its own right as much as possible.
There hasn’t been a better time than this, to start working in Data Science domain. Data Science, according to me, is the most open professional field today.
Becoming a Kaggle Grandmaster is no mean feat and requires a lot of hard work, consistency, and focus. Having said that, there is also no denying the fact that nothing is impossible. We just need to channel our efforts in the right direction and with the right tools.