Using Machine Learning to Predict Employee Turnover
In this talk, we discuss how we implemented H2O and LIME to predict and explain employee turnover on the IBM Watson HR Employee Attrition dataset. We use H2O’s new automated machine learning algorithm to improve on the accuracy of IBM Watson. We use LIME to produce feature importance and ultimately explain the black-box model produced by H2O.
- What is Business Science
- Structured and Systematic Approach With Customers
- Systematic Process Applied to Any Problem
- How Business Science Helps Data Scientists
- HR Analytics
- Cost of Employee Turnover
- Modeling with H2O
- Business Implications
- Running Lime
- Real World: Solves Real Problems
- Adaptive Solution
- Business Science University
- Q/A with Matt
Matt Dancho, Founder, Business Science
Read the Full Transcript
What is Business Science
My name's Matt Dancho. I am the founder of Business Science, and here's a little bit of information about me. We'll actually have that at the end of the presentation, so you can take a picture of it then. I'm going to get into so what is Business Science? Besides the company that I created, what it does is primarily three different things. First off, we are a business consultancy. We take pride in and what we really help organizations do is turn their data into insights by utilizing algorithms and presenting them to, we'll say, non-technical people in a very user-friendly way so that way they can take advantage of data science and machine learning without having to know all of the ins and outs of it. We primarily do those with web applications that we build and codify to help them make decisions.
The second thing, we are educators. This really applies more to the data scientists. We really take a lot of pride in educating. We have a blog on our website that is very highly trafficked. There's a lot of valuable information, and I'll talk more about that. We also have open source software and some courses I'll talk about. And then, the last thing is we are community driven. We're powered by R. We're powered by data science. We also are getting into Python as well and various other data science tools. But what we really enjoy doing is giving back to the community in the form of software, in the form of education. Okay, to explain this slide a little bit, let me give you some insights into some of the stuff that we see when we consult with organizations.
Structured and Systematic Approach With Customers
There's really three things. People are good decision makers, that's the first thing. Second thing, people are good decision makers when they have data. The third thing, people are good decision makers when they are unbiased. What am I talking there about bias? Well, as the human brain, as you can imagine, is very skilled when it has the information available to it. But in certain circumstances, people can default to their emotions, their biases. They may think that they know a problem, but really, they're just going off of their gut instinct and not necessarily the truth, which is the facts that data can provide. What we've found is that when you provide data and information, and insights to the decision makers within the organizations, you can really get great decision making. We do this through a structured and systematic approach.
It starts with this cycle here. The first thing that we do is we learn their system, and that's really just examining the business processes and understanding what they're going through within their data. The second thing is we model that data. Once we feel that we've gotten to a good model, we then convert that into something that's useful to the end user, meaning the decision maker. Just to give you an idea, that could be like a web application, that could be something as simple as just, you know, Hey, this is true false. This is the answer that you're looking for. Whatever method or mechanism, it's typically like a web based application. That's how we provide those insights to the decision makers. Now, the reason it's a cycle is because you're never really done. And this is when we think of AI in terms of working with business processes and business problems. What we're really talking about is this cycle of continual learning.
Systematic Process Applied to Any Problem
Whenever you have a problem that you have machine learning, things can happen, things change. You have to have an adaptive solution and really be able to have a robust process with a feedback loop that is able to adjust when things change. What we found is that systematic process can be applied to any problem. It doesn't matter if it's an HR problem, which is what we'll talk about today. It doesn't matter if it's a manufacturing problem. Oh, there we go. Supply chain, logistics, fraud detection, any of these types of problems can be solved through the robust set of tools that we have out there with supervised or unsupervised learning. Just really being able to apply machine learning to these processes.
How Business Science Helps Data Scientists
Next slide. That was more on how we work with businesses. I'd imagine that most of you in this room today are probably data scientists. What I want to share with you is how we help data scientists as well. And what I'm talking about is open source software and also education. I want to hit the next open source software. We have four different packages. These are R programming packages that we've developed, and what they really allow us to do is to give back to the community in the form of something that's useful. The first one is Tidyquant. Tidyquant is a financial package that's used around the world by companies all over in finance. And what we found is it's really being adopted heavily, especially with the adoption of the Tidyverse, which is something that is very popular in the R programming language. The second one, timetk, that one's for time series machine learning. The third one, sweep, is for tidying the Forecast Workflow. And the fourth one is our newest package, which is tibbletime. It's like if you're familiar with the dplyr package in the Tidyverse, it's dplyr for Time series. The general theme is time series and finance. But we get into a lot more than that too.
All right, courses. I'm going to save this till the end. I've actually got a big reveal, and I'm very excited about. We'll talk about that a little bit more. Third one. If you take away anything from this talk, go to our website. You can learn from our blog. This is one of the biggest ways that we give back to data scientists is through our blog on our website. There's a lot of good posts, good information out there. Anywhere from beginner to intermediate to expert, it doesn't matter. There's all sorts of useful information.
That's business science. What we're here to talk about today is HR analytics. Using machine learning to predict employee turnover, which is at is a huge problem. And we're going to show how we solve that with two very cool, very cutting edge programming packages. The first one is H2O with their automated machine learning. And then the second one is LIME stands for local interpretable model agnostic. Can't remember what the E stands for. Explanation, there we go. Next one.
Employee Attrition. Three reasons that you should listen to this talk. The first one, employee attrition, is a huge problem. Think about it for a company, an organization, employees are its biggest asset. Second reason, there's new techniques that are out there now, and as I had mentioned, H2O for predicting, line for explaining, very cutting edge, very useful. I'll get into more of that in a minute. Third reason, we've got a framework for machine learning and business applications. The cool thing here is I'm actually going to be telling you what I do with consulting firms and giving you the secret sauce.
The fourth one is not necessarily that our article is popular, but it's more that our article that we have out there on our blog is valued. We can tell this because it's, it's been shared. This is actually an old photo. It says 755 LinkedIn shares. It's actually upwards of like 1400. But really, I want to give back in the form of code that you guys can follow along with. The next one, all you need to do is just Google "Predict Employee Turnover." This will pop up this article, and you'll be able to follow along right with this talk and actually see all of the code that we use.
Getting into HR, one of the things that we need to understand is that employees are a huge resource. In fact, it's the most important resource that a company has. I think this quote from Bill Gates really says it all. He says that "you take away our top 20 employees and overnight we become a mediocre company". What is he saying there? Well, first thing, when you dissect it, the top 20 employees. There is a distribution of employees that are so radically important to that company that if you lose them, you lose your competitive advantage. Another way to think about it is that we have a situation where people, when they leave a company, that there's a huge cost. Next slide.
Cost of Employee Turnover
What are the costs of turnover? Companies face huge costs when employees turnover. Some of these are tangible, but the most important ones are intangible.Those consist of things like new product ideas, customer relationships, project management, engineering talent. Things like that really your organization needs to survive and needs to prosper, are lost when a productive employee quits. The good news, we've got two new techniques that are out there as machine learning is evolving. We've got H2O, which Aaron just gave a tremendous talk on what the automated machine learning. That's what I'm going to be talking about too, how we actually used it for predicting employee turnover. Just a few points about it, you can get insanely accurate models just by doing very quick and easy machine learning without doing a whole lot of feature engineering, and all the dirty work that goes into a lot of data science. It's really cool, and I think it's going to be the future. And I think we're going to see a lot of transition into this type of approach because it really helps business people concentrate on making decisions rather than worrying about all the features and all the data science behind it.
Not that that's not important, but it's just a very important aspect. The second one Lime. The great thing about H2O is it gives us a great model that's highly accurate,and it's not necessarily specific to H2O, but the downside with these models like deep learning, stacked ensembles, they're called black box. You need special tools to be able to explain what's going on under the hood, and that's what LIme is used for. A little bit about this problem, so there's actually two slides. One is that this data set came from IBM. You can imagine that most companies are not just going to hand out their HR data. Usually, they consider that proprietary. It's very difficult to go through and get that data, so we did the next best thing.
We actually got a data set from IBM, which is a consulting firm as well. The nice thing about what they're doing is they're using their experience to create this data set. And it's a good data set, very representative of what real world data, both in our experience and I'm sure in IBM's experience. It was so good. They even used it for a case study where they predicted 85% accuracy on the data. This is something that I just wanted to let you know, that there's this data set that we use out there to basically, it is publicly available but it is artificially generated and IBM analyzed it. They got 85.6% accuracy on it. A little bit more about that feature set. The data consists of 35 features. The first one that is our target is attrition. It's just a binary classification problem, yes or no, is that employee churning. There's also things in there like age, business travel, what their daily rate, their wage is, education level and so on. There's also 1,470 observations. Each observation is an employee. 1,470 employees in this data set.
Modeling with H2O
All right, the fun stuff modeling with H2O. Let me just say that this thing was insanely accurate right out of the gate. This is literally all of the code that we had to put together to be able to get accurate, highly accurate results. I really want to focus on the bottom there. All the code at the top is doing is just splitting up the data set. We took the raw data, we split it into training tests and validation sets. And then at the bottom, we use this really neat function h2o.automl. And what that does is it runs all sorts of different models, deep learning ensembles, GBMs, and even under the hood, it does a lot of other stuff that you don't even think about, like pre-processing steps. Just doing a lot of that data munging that we would normally have to go through. Literally, this is all of the code to make the predictions. 88% accuracy, that's what we got, literally that code on the previous slide, 88%. Our goals were, number one, we were competitive, we wanted to beat IBm, so we accomplished that. Important for our goal. 87.6% accuracy, highly accurate, 2% above IBM's.More important recall, 62% recall from this model. Which is really important for the business case, and I'll talk about that more in a minute. The last one we, we'll skip over this. It's the null error rate, 79%. Just so you guys understand, it gives you a feel for the data. If you just pick no, you would get 79% accuracy. We're getting 9% better accuracy from our model.
Business implications what this means is that that organization that implements a model like this can really gain some valuable insights into their employees that can help them prevent and proactively implement steps to prevent turnover. Recall is what we want to focus in on, and what recall is our ability to classify those that are at risk of turnover. We'd rather lean on the side of classifying them than missing them at the expense of accuracy. For recall of 62%, just think of it this way, 62 out of a 100 times that employee that is going to quit would be accurately identified as as quitting. Precision is the other one, and I'm not going to belabor it just for the sake of time, but you don't want to see precision drop down to zero, but it's not nearly as important for the business case as recall is.
H2O auto ml, amazing, great model. We've got highly accurate, but as you can probably know from the business case, if you've ever dealt with a company, the accuracy is probably the least of their focus. What they want to be able to do is they want to know what can they do, what levers do they have to be able to adjust things in order to prevent turnover. We've got a great model, how do we prevent it? And that's what Lime is for. Local, interpretable, model agnostic explanation. Basically, what Lime is doing underneath the hood is running a lot of permutations of linear models. The theory behind it is you've got a black box model. It might be a deep learning model or a stacked ensemble, and you know, your model is like this, but on the local level, it's actually, it can be approximate by linear models, which are more explainable. What it does is it, it does that thousands and thousands of time by default, 5,000. The beauty of this is on a local level, we can actually understand why that model, that deep learning model is specific to an observation is saying yes or no, that employee's going to turn. We'll see a little bit more about that, but it's really it's for the business implication, it couldn't be a better situation.
Lime one more thing about it. It's now integrated with H2O. I worked with Thomas Peterson, Thomas Lynn Peterson, who poured over the Lime concept from Python into R, so he's the maintainer of the package. And I worked with him to get H2O integrated. Now, if you actually go through the article online, you'll see a bunch of extra like steps that we had to take, but we were really able to now cut out a bunch of those extra functions and just run Lime seamlessly. Here's how you use Lime. The first thing you do is you create an explainer object. And basically this, all this is doing is creating a recipe for the next step.
The next step here is really when you create that explainer. That explainer from the previous slide, all it did was it took your training set and your model and created that recipe. In this step. Whoops, go back. Thank you. In this step, what we're doing is actually taking a subset of our observations that we're interested in, and then we're running it with the explainer object. And a few arguments I want to point it to n_features 4. That's going to produce our top four features for whatever cases we provide to it. And then the last one is the kernal_width. You adjust that in order to tune your Lime response to make sure you're getting a good fit from your model. That produces an explanation. Then what it has is this cool function called plot features to visualize the explanation.
You can actually just go through and you can see from the visualization, we've got four cases here. Each case has the top four features listed for that model, explaining why it picked yes or no for attrition. And then, the last thing that we can do is then we can understand globally, we can compare it to what's going on in our model. What we saw when we did this process was two features jumped out. We looked at 10 observations and consistently the features that were in those observations over time and job role tended to be high predictors of attrition. From once you start to dive into this, those featuree that tend to be more important, you get to see, okay, this is why things are happening within that model. Becomes much more explainable and that's great for the business case because that's exactly what they need to know to be able to make their decisions.
Real World: Solves Real Problems
We talked a lot about a hypothetical example in the real world though, does this work? What we found is that a Fortune 500 firm came to us wanting to predict executive potential. We were able to implement a slightly different process, but more sophisticated and we were able to identify using a similar approach, 16 employees that were considered executive potential that the model predicted it as executive potential, but they weren't currently being targeted.
So it does work. Conclusions, high accuracy from H2O, high explainability from Lime. And you now have a framework that we can actually apply with, with clients for high accuracy and explainability. We're done, right? No. Couple things here. We don't know if the model is how do we know that the model is right? We haven't back tested it. There's a time aspect of it that we need to consider. it's a very cross sectional analysis, and the only thing that's certain is that things will change over time. Adaptive solution, that's where the adaptive solution comes in. We build AI into the systems. These are learning solutions with that feedback loop, so that way we can continuously adapt as things change.
What about communication? This is a picture of an app. We actually have some updated apps on our website. These are really what we build examples of what we build for clients to be able to identify whether or not you know, people are going to turn or customers are going to be lost. And it allows the clients to actually have a scorecard to be able to actually interface with the machine learning without having to know all about data science.
Business Science University
Quickly, we're kind of running out of time. I want to save a little bit of time for question and answer, but if you want any of my information, take a picture now because we're about to roll a video, and I'm super excited about this. This is the first time publicly that I'm talking about our next phase in Business Science. We're getting ready to roll out a new educational platform. It's called Business Science University, and we have a video that I want to show you. It only takes a minute and a half, but I think you'll understand what the benefits are for people that are new coming into data science that want to learn how to apply data science for business.
university.business-science.io. I want you guys to understand that we are committed to educating data scientists as well. We aren't just all about, you know, driving our business with on the client end. We really want to drive it with the data science end as well. We've been giving out a lot of good information on our blog, and we're about to take that to the next level. What we're rolling out now is university.business-science.io. That's our Business Science University. In early 2018, we're going to be rolling out two new courses with H2O involved. The first course is going to be how to use HR analytics to prevent employee turnover, and we're going to take it to the next level by adding additional features in there on how to build a recommender system. And then, the second course is going to be an extension. It's going to take the model that you build in the first course and actually turn it into a shiny web app to actually distribute that to a non-technical person so they can have the advantage of machine learning without the technical details. Thank you very much.
Q/A with Matt
The first question, what if real data instead of simulate one are highly sparse? The automated machine learning algorithm, and probably Aaron's probably the best one to explain what it's all doing underneath the hood. But what, from what I understand, it really takes care of a lot of the typical or common data science issues that you might run into. Sparse data could be an issue. I think it has things to deal with that. However, it's going to depend on the data, so we'd have to take a look at it and really try it out.
Next one. Is there a way to use Lime on data with a lot of missing values without imputation? Missing values, you have to have a way to deal with them. Even the automated machine learning isn't going to work with missing values.You have to either fill them in with like negative 99 or impute them somehow. You have to have a way to deal with it. Line the same thing. The lime is based on your model, so you have to have a functional model in order to be able to run it. In my opinion, I don't think Lime would work well without imputed values.
What kind of main features are most useful, as for your employee turnover prediction? The employee turnover prediction model that we saw here, the most highly correlated variables from the feature importance that Lime generated.That was whether or not they were working overtime and what their job role was. In the interest of time, I kind of had to go through those slides pretty quickly. But the overtime, what we found was that if the, the population of employees that were not working overtime were much, much less likely to turn over. If you think about it, what the natural logical conclusion out of that is to try and help them out with, have that manager help reduce the work life or help the work life balance. That was the one issue. The other one was the sales representatives. The job role was very key, so sales representatives were turning at like 40%, whereas like managers and some of the other roles were only turning at like 4%. It ended up being very dependent, highly dependent on the job role. Which, if you think about it, is like a cohort within your data. It's a way to kind of group different, like people together and certain groups turned at much higher rates.
Any other questions? What is the real life accuracy for your employee prediction model? The accuracy on this model, on the test data set, so this was on unseen data was 87%, or excuse me, 88%. The real life accuracy I would consider that the real life accuracy because it's on a hold outset. Unseen data from the model. That's what I would consider the real life accuracy. Okay. All right. Thank you very much.