Ankit Sinha, Experian - Ascend Analytical Sandbox - #H2OWorld
This session was recorded in NYC on October 22nd, 2019.
Slides from the session can be viewed here: https://www.slideshare.net/0xdata/ankit-sinha-experian-ascend-analytical-sandbox-h2oworld
How businesses can recession proof themselves by using the power of the Ascend Analytical Sandbox; and how Experian is leveraging its vast data to make sure every borrower is presented in the best light in front of the lenders.
Bio: Ankit is the Product & Innovation Expert at Experian leading the overall roadmap for the Ascend Analytical Sandbox; a one-stop shop for insights, model development, and results measurement.
Read the Full Transcript
Hi, good afternoon, everyone. Thanks to H2O for having me here. Just a brief background, I’m a product manager at Experian and I manage the overall roadmap for us and article Sandbox. And we’ll be talking about that in a bit detail today.
I am really glad about this opportunity, because before Experian I was working for PNC bank and I was actually working in the sales analytics team and I was creating wallet share, market share, win rate, penetration rate analysis for retail lending assets, and Sandbox was one of the tools that we used. So I can really speak from what it felt like by using Sandbox.
And also on the other hand, before working for PNC, I was working for another financial services institution in the credit risk strategy team. And I was building line increase account management, account initiation, acquisition strategies. And I did not use Sandbox for that. So I can talk about a bit of contrast of how easy my life would have been if I used Sandbox, for example.
So going to the next slide. So let’s talk about how we can build a more resilient and robust model. So Shi, in the morning session, he touched upon the AI technique that H2O has. But I think one of the key elements that H2O doesn’t have is the data behind it. So in terms of building a resilient model, we can have interpretability and machine learning capabilities within that. But the key elements for that is data. So how good the data is essentially determines how robust or resilient the model is going to be.
So being at Experian, one of the jobs that I can really take pride on is data quality. And I’ll touch upon that in later slides. So using robust data, resilient data, all rounded data, how can we build more resilient models around that?
And also to touch upon that, in the resiliency of a model or any strategies for that matter is really tested in times of stress. So stress could be due to an upcoming recession, or it could be any scenario in which the resiliency of the model will be tested. So in this case, it could be customers losing a consumer, losing their job, and how it really impacts their credit and stuff like that. And how, in that scenario, we can have a resilient model which can predict and also be predictive enough to speak to how much time will it be for the consumers to get back to paying their debt on time, or get back on track essentially.
Speaking of that, if we just touch upon the recent articles that I’ve been reading, depending on which journal we read, there is a 25 to 35% probability of recession in the next 12 months. So having said that, as I said before, it’s really important for us to have resilient models by using data which is solid and has the capacity or ability to sustain these shocks. And even in times of stress, can help us predict a good outcome in the future.
So now with that being said, I would like to just ask this question, “Are we well equipped for the downturn?” This is a question that almost everyone has right now, no matter what kind of job we are. It could be a financial institution, it could be a marketing company, it could be a retail company, whatnot. We are always scared about the downturn.
So now getting into the specifics of what I was going to talk about today, the analytical Sandbox. So what it really is, is that it’s a tool which gives people access to basically about 18 to 20 years worth of snapshots, monthly snapshots of data. It contains full file of about 220 plus million customers in the United States. And it has straight line level information. It has the attributes, these customized attributes that Experian has, which are more customized, depending on which industry we’re talking about. It could be related to your car debt, could be related to your mortgage. Those are some of the attributes that we have. We can call it features or attributes, depending on what language we’re using.
The other thing that I would like to touch upon, other than the fact that we have a great line of information so we can see exactly how many trade lines that person has, number of mortgage, number of auto loans and whatnot. The other thing that I really take pride on saying is Clarity. So Clarity was one of the Bureau that Experian simply bought. And it has information of alternative trades. So by alternative trades, it could be payday loan, it could be title loans, anything which can sort of have a red flag, so to speak. And it’s not necessarily, it currently has about 60 million customers on that file. It has about five to six million purely incremental population, but there are customers who have a prime score, but they also fall in the Clarity Bureau.
So my point being, this is a golden opportunity for swap outs, so to speak. So you can have a 740 plus customer who has, sorry.
740 plus customer who has a Clarity trade, could be a red flag. Maybe that person recently lost their job and has a trade line for payday loan. I don’t know, I would be a little cautious. Also on the other hand, we have access to Experian Boost, which is another score, which takes into account utility payments, rent payments, anything which is not captured by a regular credit Bureau. It includes, all that data is also part of the Sandbox.
Now, the critical factor there, is that any trade line, the customer essentially agrees to give that data. So all ACH transactions related to utility payments or rent payments are captured by the Boost. And it feeds into Vantage or FICO. So there is not a separate score that the consumers or the institutions need to buy to get this. It’s all feeding into Vantage or FICO. And again, all the data inside it is depersonalized, no PI information is there, just to support the the FCRA compliance.
Now on top of that, the critical piece, again as I mentioned, which we have been discussing since this morning, are the tools. So we have the data, now how do we see what tools we can use to identify or analyze the data? So we have tools like Python, R, SAS, SAS WIA, H2O.
One of the key things that I will say is that SAS WIA is a new tool that we have, that we offer in the Experian Sandbox. And the reason it’s worth mentioning is that regular SAS users use as a single core or a single node to process the data. But in case of SAS WIA, we have a multi threaded process, which is happening in the backend. So by just based on some of the test cases that we had, it can be about 20 times faster than regular SAS, especially if you’re talking about massive chunks of data, worth millions and millions of rows. So that’s a critical point that I would like to mention here.
And also last but not least, the support. So we at Experian have a very dedicated support team, which is highly trained in understanding what the data the client is using. Their use case scenarios and how we can help them scale up, or actually help them use some of the tools that we have that they may not be using. Or how to help them build scalable models, stuff like that.
So one of the success stories that we have, is by OneMain Financial. They recently started using Experian Sandbox for their model development, and this is one of the quotes that the head of their model development had to say. Some of the use case that they had was to analyze the loan loss performance. They did some reject inferencing. In fact, we started calling it Bureau inferencing because it’s not just for rejected or declined population, so to speak. It could be for your entire customer base. You could see what other trade lines they have.
So even before you approve or decline, even before that, if you are sending a direct marketing letter to them, or you’re marketing to them, you know even before what kind of traits they have. Could they be in market for an auto loan? Depending on their behavior, depending on your propensity model, and so on and so forth.
So that’s something that is very interesting for us. And also we are glad that we can help our clients achieve that.
This is just an overview of what I just talked about. One thing I would like to highlight in this slide, is the visual. I think that’s one of the most important aspects of this slide. So basically the ascend analytical Sandbox is like an interface, and I’ll show how it looks like in just a second. And through that, you can access the entire Experian data that is out there. It could be for credit, for commercial, for business, and whatnot. And you can load your own data to the analytical Sandbox. Again, it will be depersonalized. You can link it and you’ll have access to 220 plus million customers. All their trade lines, again, going back to 18 to 20 years, and you can identify your trade line within that.
So I don’t need to say what the power of that can be in terms of building models or building strategies or even for just marketing purposes. And on top of that, you have access to these tools, so SAS, SAS WIA, Jupiter hub, H2O, Tableau, and all that good stuff.
So just a summary slide of key advantages. One thing that I would like to highlight out of all the points on the slide, is speed to market. So, as I said before, I used to work for a financial institution in the credit risk strategy team. And one of the key things we liked to look at was reject inferencing. And analyze population population that we did not approve, and what is happening to that population. Between the time that it took to get approval from the executive management, to get the budget for that, and to go to the Bureau to actually place an order, and for the Bureau to actually work on the file and give it back to us. We’re talking about 50 to 60 days right there. But if you are talking about Sandbox, all that data is already the
So you can right after the meeting, you can go to your computer and you have access to data. And plus the other thing, which was the other important thing, is that depending on which Bureau you go to, the data is going to be slightly different. But if you’re familiar with the data your analysts have access to this data, and you’re using it on a day to day basis, you’re very familiar with that. So even from that perspective, the lead time reduces significantly.
So before here, I would like to just quickly jump into how actually the interface looks like. So basically you just go to Sandbox or Experian.com. It’s a two factor authentication. I log in, it sends a push to my cell phone. I say, “Yes.” And that’s it.
So we are using Citrix to log on, very familiar. It’s just like any other Windows machine. And all the tools that I was saying, you have access to all of that right here.
One thing just wanted to show was one of the models that one of our data scientists at Experian built. Was on Jupiter hub by using Python by Spark. And if I can open that. So this is just an interface of how it looks like it’s using all the premiere attributes, spits out the correlation matrix below that in a heat map format. And you get the most important premiere attributes out there.
Now, if you go back to the slide that I was talking about. So just wanted to highlight how easy it is to analyze. So in this case, we are seeing that the orange line here on the chart on the right is the modeled bad rate. And the blue bars are the actual bad rate. So we can see that the actual or the modeled bad rate is actually under fitting the actual bad rate, which is kind of apparent from the confusion matrix. The true negative is actually lower than the false positive.
So we understand that, okay, based on the predictive attribute that we selected, it sounds like there is something off. So can we go back to the model, play around with the attribute that we have, maybe weight a few attributes higher than the weight that they’re already getting. And can we change that? And all this can be done inside Sandbox. This is a preview of Tableau. And all the data is right there. The tools that you analyze is also there. You can spit out the output, analyze on Tableau. Everything is right there. If anything needs to be changed, you can change the weight of the attribute and play around with it as much as you want.
So this was just an overview of one of the use cases. Again, as I said, there can be several more, but yeah, we used a random forest classifier for this technique. But that’s all I had. If there are any questions, I’d be happy to.
Yeah. My question is more towards business data governance.
On this model. The sample you have here, false positive and negatives. How do you prove those are on business context? Are right. And once you have a trust on the model, regulatory compliance may ask you the lineage of the context, how you produce? If there is an issue. So how are you going to go on that model?
So if I understand the question correctly, so everything that we have in Sandbox is FCRA compliant. So no matter what you use, which attribute you’re using. So this model, I think the attribute it was using was number of times 30 days past you in the last 12 months. That was the attribute it was using. That attribute itself is FCRA compliant.
Yeah. The data behind is compliant.
When you produce the data for final consumption, also needs to be compliant for the quality. So how we know what you produce is business going to, “Hey, this is how you’re going to go on this, what you consume”?
So I suppose that will be, based on the output that you have, if the model is. And so as you mean that we are looking at less than 90 days past your delinquency on the cardboard folio. And if it’s matching similarly to what the actual bad rate was, then you could say that, “Okay, it’s working as it’s supposed to.” Now not worrying about the compliance aspect of the attribute itself. If it’s not behaving like it should, then again, you can go back and see if there was a sampling error, or if there is a skewness in the data that we can look at.
So data quality after the model is built, obviously is going to be something that we’ll have to analyze, or the analysts at the solutions we’ll have to analyze. And based on that, they can take result.
But what we can provide from our end, is that everything that the need to make sure that the data is reliable, or the model is resilient, or it is producing the results that it’s supposed to produce. We are providing all the ingredients for that. Now your financial institution can slice and dice however they want. Use it however they want. We have our support team at Experian who can help you guide through that process, but ultimately you are the consumer of data.
We have time for one question.
When you are talking about recession, the technology is important, but probably more important is the prediction.
Can you repeat, what is your prediction [inaudible] recession, please?
The prediction of this model?
Oh, the prediction of the session.
Oh, [inaudible] recession will happen or not.
Oh, the recession. Oh, I’m sorry. I do not know. Well, I’m just a user or consumer of all the journals out there. I’m not here to predict the recession. I hope it doesn’t happen, but yeah, that’s what I lean towards.
But yeah, all we can do is just prepare ourselves to be prepared for the shock if they ever come. I’m in no way to predict what or when that will ever happen, if it does happen. But all we can do is that, let’s make the models as resilient as possible, and it’s getting more and more and more possible by tools like H2O. So let’s combine data and the power of analytics, and hopefully we’ll be prepared for all these shocks that are to come.
Thanks very much.