H2O Demo: Airlines
H2O data scientist Amy Wang walks us through predicting flight delays using H2O's UI, Flow.
Amy Wang, Data Scientist, H2O.ai
Read the Full Transcript
The following is the demonstration of predicting potential flight delays using a publicly available airline dataset on H2O. For this example, I've spun up a 60-noded H2O cluster on EC2 instances, and we will use the entire 26 years' worth of flight information available on RITA. The data set itself is about 152 million rows long and about 14.5 gigabytes uncompressed on disk. On the browser, I have pulled up a Flow notebook, which is a collection of calculable cells. Flow is H2O’s web-based interface. It’s a great way for new users to get started and learn all the available features and algorithms that H2O has to offer.
How Machine Learning can Help Predict Flight Delays
In Flow. You can alternate between tech cells and executable cells where the user can type or have H2O generate CoffeeScript that can be run programmatically and shared between different users. There are obvious benefits to predicting potential delays and logistic issues. For a business, it helps the user make contingency plans and corrections to avoid undesirable outcomes of which include, for example, recommendation engines can forewarn flyers of possible delays and rank flight options accordingly. Other businesses might pay more for our flight to ensure certain shipments arrive on time and airline carriers can use the information to better their flight plans. The goal is to have the machine learn and take in all the possible factors that might affect a flight and return a simple probability of how likely a flight might be canceled or delayed.
Importing and Exploring a Data Set
To start, we're going to import the data set from S3 where the data is actually sitting. Right now, all you have to do to actually execute a cell is to hit control+return. During the parse setup page, you get to choose the column names as well as the column types. So here for year, I want to choose it as a numerator instead of a numeric. Day of week, as well, gets treated as an Enum or a factor. So during the model build process, they will get expanded out automatically into the dummy variables. So going down, flight arrival delay. Flight number, as well as the route number, is Enum. Tail number is Enum. Perfect. And then you parse.
So the beauty of something like Flow is that you can point and click through an entire workflow, but what's generated as you're pointing and clicking through the interface is CoffeeScript. That is automatically generated on your end. Or if you can write more programmatic scripts, you can actually write the CoffeeScript yourself and save the notebook and pass it on and share it between users. Okay. Frame. And now we have a summary of the frame right here, in which case we can also convert column types to numerics or Enumerators if necessary. If you want to visualize an individual column or explore a specific column, let's say origin, let's look at a tally of all the different factor levels and the column and the count of each of those factor levels. So it gives you an idea of how distributed the features are. So in this case, it does look like in our particular data set there does look like a lot of flights.
Building a Predictive Model from a Data Set
This is Origin, so a lot of flights coming out of O'Hare, Chicago or Atlanta, or Dallas Love Field from Dallas. So once you're done exploring the data and you actually want it to build a predictive model that will get put into production here you can build model, and it will pull up a list of the current algorithms that we have exposed to the front end. We have a whole host of algorithms that we are currently working on. And slowly we will add that to the list of models that you can build. At the moment, we are going to build, for this demo, we're going to build a general linear model, a logistic regression model. So when you choose to build something like a GLM model, you just click GLM and it will populate a sort of a point-and-click field. To really track the speed and scale of something like H2O, you have to track the performance of the model build which you can do by going to admin and water meter, which will pull up perf bars or performance bars. And here we have 16 boxes, so 16 blue boxes. Each of the lines of the blue box represents a core on the machine. So 16 nodes in the cluster and about eight cores to each of those nodes. You can choose the airline's data set here and then point and click and say ignore all. So we're going to ignore all the columns except year, month, day of month, day of week. You need the carrier flight number, origin-destination, and distance of the flight. Choose binomial.
Analyzing a GLM Model
And the response column is departure delay, and just run it. And so what happens when I, I'm going to pull this aside here is when I build a GLM model, what you see on the side is the green is user time or computation time. And when I execute a built GLM model, it will essentially spark up all 16 machines with data sets, sort of with parts of the data or chunks on each of those machines being used to compute to build a GLM model. So the process and the computation are completely distributed and paralyzed. So once your model is done building what you have in the model output include everything from ROC Curves for binary classification problems to the coefficients for the GLM model. And in the cases of tree-based modeled deep learning models, you can get variable importance in a similar bar chart fashion scoring down, you have the scoring history for GLM, you have the objective value over iterations for many of the other algorithms, it is the scoring history of decreasing mean square error over increasing number of trees or increasing epochs of the model build.
Applications of a Predictive Model
And if you specify a validation set, it will give you all those same summaries except instead of on the training set, you'll get it on the validation set. And finally, what we have here is the plain old Java object, which is what you can take by doing a simple curl command and then compiling it against the show gen model .jar file and put that into production, into anything that can read Java code. Essentially this could be taken into a Storm Bolt. It could be read into a Spark streaming, so you can do real-time streaming predictions or it could be written into a hive UDF project. All of which are up to the user's use case and environment with very low costs on the part of the data scientists or the developer who will need normally to translate over an algorithm that a data scientist has produced and built and translate it into something that can be taken into production.