Are you a baseball fan? If so, you may notice that things are heating up right now as the Major League Baseball (MLB ) World Series between Houston Astros and Atlanta Braves tied at 1-1.
This also reminded me of the MLB Player Digital Engagement Forecasting competition in which my colleagues and Kaggle Grandmasters, Branden Murray, John Miller, Mark Landry, and Marios Michailidis, earned a second-place finish earlier this year. So here is a blog post about their baseball adventure. You will find an overview of the challenge, their solution as well as some interesting facts about each team member.
“In this competition, you’ll predict how fans engage with MLB players’ digital content on a daily basis for a future date range. You’ll have access to player performance data, social media data, and team factors like market size. Successful models will provide new insights into what signals most strongly correlate with and influence engagement.” – challenge description from Kaggle.
In short, the participants were asked to predict four digital engagement targets based on player/team performance as well as social media data. In order to prevent information leakage, the sources of the four numerical targets in the training set were never disclosed to the participants . John did a post-competition analysis and his hypothesis was that the targets represent typical digital engagement measures such as shares/retweets, likes, mentions, and comments. Yet, no one really knows what the targets mean except the competition organizers .
If you are interested in a data deep dive, you can download the data and check out some of the explanatory analysis notebooks that won the explainability prizes. Below are visualizations from those notebooks.
“Our final model was a blend of five LightGBM models with different settings and one XGBoost
model for each target (24 models in total). The most important features were aggregates
of historical targets for each player, features related to the date, and the number of Twitter
followers. Actual “baseball game” related features that were important were the daily
maximum Win Probability Added around the league (a proxy for if a player somewhere
in the league had a really good game), a player’s current ranking in the home run race,
and whether there was a walk-off somewhere in the league.” – competition summary by Branden.
In short, the team generated some highly predictive features based on their years of Kaggle experience and blended different models together for robust predictions. The outputs are four digital engagement predictions for each player on a specific date. Although the team have no information about the meaning of each target, the predictions are still useful to the competition organizers. The organizers can decode and extract the real digital engagement measures from the numbers .
For those of you who are interested in technical details, here is a summary of the models and features. You can check out the end-to-end process from this notebook . Of course, you can also skip this section and scroll down for the fun facts about each team member.
objective="regression_l2"– targets scaled by double square root
Now let’s find out more about team AutoMLB:
High-quality feature engineering and robust model tuning are the keys to getting the best predictive models. The process of identifying the optimal feature engineering steps and model hyperparameters can be very repetitive and time-consuming. It may also take years of practice to get it right.
The good news is that the process can be automated for many common machine learning use cases. With H2O Driverless AI (part of H2O AI Cloud ), you can leverage our Kaggle Grandmasters’ battle-tested machine learning tricks and build feature engineering and modeling pipelines automatically with ease. Sign up today and give it a try.
Many thanks to my colleagues mentioned above. Here are their Kaggle profiles:
This was not the first time we did something about MLB. A few years ago, I did a Moneyball project which led to a real, multi-million contract. (Shameless plug, I know.)