April 12th, 2013

Predicting Airline Data using a Generalized Linear Model (GLM)

RSS icon RSS Category: Uncategorized [EN]
Graph for origin airports

Just recently I created a wiki post on the H2O Github page with step by step directions on how to predict if a flight’s arrival would be delayed or not. I essentially uploaded airline data from the American Statistical Association to H2O and used GLM (also known as generalized linear model, logistics regression, or logit regression) to predict “IsArrDelayed”. I have pasted snippets below, you can view the full source on Github here.
Hacking Airline Data Set with H2O Here is your data: Airline Dataset The data comes originally from RITA where it is described in detail. You can download the data there, or from the bzipped csv files listed below. These files have derivable variables removed, are packaged in yearly chunks and have been more heavily compressed than the originals.
You can play around with the following data sets. You can also download your data to imgyour local machine, EC2, Rackspace, HDFS, or S3
Airlines all years 1987-2008:
https://s3.amazonaws.com/h2o-airlines-unpacked/allyears.csv (12 GB)
Or if you want something bigger here is the 10x version:
https://s3.amazonaws.com/h2o-airlines-unpacked/allyears_10.csv (120 GB)
Or if you want to play with something smaller collection of 2000 rows from all years airline data set. For this example we will be using this this data set:
https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv (4.5 MB)
GLM Model
Now that you have your data, lets use H2O to run GLM on your data and make some predictions! In this example we will use H2O to put together a model to predict if a flight will arrive on time. Look at the “IsArrDelayed” column/category, we will use the data we have to predict this column using GLM (Generalized Linear Model). First off, what is GLM? Some call it logistic regression, others call it logit regression. GLM functions work better if you know something about the distribution of the variables. By definition, “In statistics, the generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.” (Wikipedia > GLM) In short, its a good algorithm for predicting stuff. You can also get more info on The H2O GLM and its arguments here.
Lets start H2O by following the simplified instructions below to get a model on our data in less than 5 minutes to predict if your flight arrival will be delayed or not. If you want more detail on multiple options like starting an H2O cluster, launching with HDFS or S3 check out this wiki link: Start H2O from the command line

  1. Open your command line and run the H2O Jar: Java -jar h2o-*.jar
  2. Paste the H2O URL in your browser and your on the H2O home page. It should look something like:
  3. Upload your data: Once H2O is running on your browser, click on the Data -> Upload to upload your data set
  4. Parse your data:Once your data is uploaded you will get a look at a sample of your data. Verify that your data has headers then submit to be parsed.
  5. Once your data is parsed you will see the .hex file in a spread sheet which will point out the size along with number of rows and columns. Review the your data to decide what category/column you would like to predict and which to not include in model. Here is a screen shot of what your parsed data looks like…
  6. Choose your algorithm and build models: at the top ofthe page you will have 4 choices for modeling. For this case pick GLM and begin at the H2O request GLM page
  7. Y - What you want to predict: First pick your “Y” which is the category/column that you want to predict. In this case we are using “IsArrDelayed” as our “Y” to predict if the arrival of a flight would be delayed.
  8. X - What you want to ignore:Next pick the categories/columns you want H2O to ignore in its analysis, “X”. In this case we want to look for the things that are not available to a traveler for more than 24 hours in advance of buying a ticket so they can predict “is my flight going to be delayed?”. We want to ignore “ActualElapsedTime”, “ArrDelay”, “DepDelay”, “Canceled”, “Diverted”, “IsDepDelayed”. Why? Well, “ActualElapsedTime”, “Canceled”, and “Diverted” we would not know. And with “ArrDelay”, “DepDelay”, and “IsDepDelayed”, H2O can use the data in these categories to make an exact prediction (AUC =.99999). Math is smart! “CancelationCode” is defaulted to be ignored by H2O since it has a lot of NAs.
  9. Pick Your Family In this case we are picking Binomial since our answer is 0 or 1, delayed or on time.
  10. Link:In this example we pick logit (logit regression) since its a boolean logical 0 or 1 which is due to us picking binomial.
  11. Alpha and Lamda: H2O defaults to a small Alpha (.5) and a large Lambda. This is so that the coefficients that are bad predictors go to zero so you can find a subset of columns that are your best predictors. Basically, it makes your model generalize better.
  12. Folds: how many folds of cross validation? Run 10 since its not a big data set and the performance is reasonable. For a small data set you can essentially set the number of folds to the number or rows to get more cross validation which typically isn’t useful beyond 10 or so for large data sets.
  13. Run GLM by clicking SubmitBelow is a screen shot of what the output looks like. Review your equation and your coefficients

Scroll down to review your AUC, confusion matrix, and your 10 fold cross validation.

Review Your Model
You can see how long it took for H2O to run your model, your coefficients, and your AUC. Once you study the coefficients you can see which affect your accuracy. When looking at the coefficients for which ones matter, look at the “normalized coefficients” – which are scaled to the same level. Otherwise columns with large values (such as e.g. 5000 miles traveled) look “more important” than columns with small values (such as 0 or 1 for SFO or not-SFO). In this case for example, Dest.SFO has a coefficient of ~0.282930 which is quite high and means that when your destination is SFO your chances of a delayed arrival flight are higher. If you live in the SF Bay Area, you are aware of this as I am sure you have experienced multiple weather delays due to fog while landing into SFO. Same with Dest.SEA (~0.38255) and Origin.ATL (~0.5666). I personally was delayed multiple times flying out of Atlanta (ATL).
H2O GLM Chart 2

Leave a Reply

Recap of H2O World India 2023: Advancements in AI and Insights from Industry Leaders

On April 19th, the H2O World  made its debut in India, marking yet another milestone

May 29, 2023 - by Parul Pandey
Enhancing H2O Model Validation App with h2oGPT Integration

As machine learning practitioners, we’re always on the lookout for innovative ways to streamline and

May 17, 2023 - by Parul Pandey
Building a Manufacturing Product Defect Classification Model and Application using H2O Hydrogen Torch, H2O MLOps, and H2O Wave

Primary Authors: Nishaanthini Gnanavel and Genevieve Richards Effective product quality control is of utmost importance in

May 15, 2023 - by Shivam Bansal
AI for Good hackathon
Insights from AI for Good Hackathon: Using Machine Learning to Tackle Pollution

At H2O.ai, we believe technology can be a force for good, and we're committed to

May 10, 2023 - by Parul Pandey and Shivam Bansal
H2O democratizing LLMs
Democratization of LLMs

Every organization needs to own its GPT as simply as we need to own our

May 8, 2023 - by Sri Ambati
h2oGPT blog header
Building the World’s Best Open-Source Large Language Model: H2O.ai’s Journey

At H2O.ai, we pride ourselves on developing world-class Machine Learning, Deep Learning, and AI platforms.

May 3, 2023 - by Arno Candel

Request a Demo

Explore how to Make, Operate and Innovate with the H2O AI Cloud today

Learn More