Building ML Models to Detect Malicious Behavior

This meetup was recorded in San Francisco on January 23rd, 2020.

In this talk, we will explain the fundamental design and approach of looking at malicious behavior, some of these approaches are used in identifying, electronic fraud, attacks, and anti-money laundering. Along with example features that are members of this fundamental idea. Additionally, we will touch upon designs of embedding solution in your application – so that you could be proactive (catch the act while happening) or reactive (catch it after). Finally, we will also talk about design approaches to modeling making models more effective.

Speaker’s bio:

Ashrith Barthur is the security scientist designing anomalous detection algorithms at H2O.ai. He recently graduated from the Center of Education and Research in Information Assurance and Security (CERIAS) at Purdue University with a Ph.D. in Information security. He is specialized in anomaly detection on networks under the guidance of Dr. William S. Cleveland. He tries to break into anything that has an operating system, sometimes into things that don’t. He has been christened as “The Only Human Network Packet Sniffer” by his advisors. When he is not working he swims and bikes long distances.

Read the Full Transcript

Ashrith Barthur:

All right. Yeah, as Bruno briefly introduced me, I work as a security scientist, actually as a principal security scientist out at H2O. My primary job is to basically architect and build solutions that include models which try and detect malicious behaviors across the internet. It could be anything to do with financial fraud. It could be to do with electronic fraud or networks or anything of these things and the general idea that I want to take today across my slides is how do we go about building out? What are the talk processes of actually going it? What are the key components that we look for and what is important for you to look as well in case you are building models for malicious behavior?

Quick question. How many of them here out there are data scientists or ML engineers who do this on a day to day job? Great, okay. Quite a few number. How many of you guys out here work in the field of security? A wow, okay. Fair. That number usually is never bigger than two. I’m actually happy it’s about five so yeah.

Quick set of slides about H2O. How many of guys actually know about H2O? Curious. Okay, so this seems pertinent. We are based down in South Bay out of [inaudible]. We’ve got a good amount of money and then we’re about 200 plus people in the company, a good number of [inaudible] out there who help us build models. The company comes up with a lot of engineers, a lot of scientists who work in many different areas. We are primarily a mission learning platform company and we build quite a set of tools which help people build mission learning models for many different use cases and I primarily work for the security solutions group.

Now, we’ve got offices based out of, across the world, good part of the world. Let’s put it that way. In case anyone’s interested, thinking of joining us, please contact Luna or me. Be happy to help you guys out.

A set of our clients, how good we are in the community and that should be good and we primarily have four different products that we, I would say that the company puts out. Driverless AI is our current flagship product which automates a lot of machinery and processes that you usually used to do. H2O and Sparkling Water are actually the open source variants of mission learning model building. H2O Q is slated for the end of Q1. It’s actually a new platform that we’re building which helps us build models much better with insights and modeling together coming under one platform. That’s like an integrating platform that we are currently building which brings in other things for … You can build models and insights at the same time which seems to be like a pain point currently in the mission learning field.

All right, so that’s set of slides about the company and this is a general structure of my talk. I’ll talk about the problems that usually we look at when it comes to identifying malicious behavior. I look at what are the kinds of data you necessarily need to have, how do you look at the modeling process itself, the featured design and the modeling design itself. Now, having said this I would urge you to have … Call out and ask me any questions that you have at any point in time. Completely okay with the idea. If you want to hold it until the end and get this set of slides and see if your questions get answered, fair too. Either way works for me. Don’t worry about it. I’m happy to answer anything.

Generally speaking what is that we’re looking at malicious behavior? What are the kinds of malicious behavior that we’re looking at? There are many different kinds of malicious behaviors. You could have some of them that are problematic for you that we have listed on the fields right there or they could be other things as well like someone breaking a traffic rule. That’s not necessarily the kinds of things that we’re concerned with when it comes to the modeling that we’re building right now. We are essentially … Actually, give me one second. Can you guys look at the screen from there? Is it possible for you guys? Maybe you guys might want to scoot in or come … Okay, great. That’s fine.

There are various sets of malicious behaviors that we actually look at when we build models. A good number of them comes in the field of electronic fraud where there are many different kinds of frauds that actually happen. There are people who have stolen access to accounts and then they’re doing transactions in someone else’s name, then there’s usually impersonation and phishing. This is one of the biggest use cases that we deal with where I sent you an email saying I’m the king of Nigeria and I needy your account and that usually how it goes. I’m sure everybody knows about it. Then the usual, the credit card fraud. Someone use your account in the middle of the night at 3:00 AM. You would’ve never done it because you sleep at night. Well, that’s a problem. Then you have the usual transaction fraud. I would’ve actually gotten access to your account. By some sort of [inaudible] means I would’ve gotten onto the docnet, probably gotten access to some of your account information and then I come back and use it before the bank and you figure out I’ve drained all the money out of your account.

The last one which is personal but unknown transactions and phishing is actually one of the most interesting use cases that we deal with. Essentially what happens in this type of malicious behavior is … Actually, I’ll put it a different way. What a lot of people have figured out is that stealing your passwords or stealing access to your system does not make sense because at the end of the day you get discovered. If I were to steal your account usernames or passwords at some point in time, 10 days later, 20 days later, a month later someone will figure it out and then there account gets blocked. What a lot of people do is they run access to your account through your machine when you’re up so basically all your credentials are captured, kept in your system and then they run it from the same system so as to not create any suspicion to the bank or to you as a user.

That’s probably one of the most interesting use cases that we deal with and the way we work in all these things, we work with many different financial institutions and interesting agencies to try and deal with this problem and we usually see quite a few variants of these things but much of these things actually [inaudible] trying and predicting any of these frauds happening around the time or just after it’s happened and that’s essentially what we build models for and that’s what we’re trying to solve out of it.

Now, your malicious behavior can be classified in two forms. One of the things is it’s actually criminal in the legal parlance if you’re looking at it that way and the other one is it’s actually if you look at it sadistically, if you look at it merely as an aspect of the data it is not normal. Essentially there is something out of the ordinary that stands out that should stand out. Let me correct it because a lot of to his kind of data tends to seep into the family and don’t necessarily stand out and that’s essentially what we’re trying to find out. It’s not the first problem that we essentially deal with. That’s for a lot of law agencies and law enforcement agencies to work with. We deal with the second problem and by dealing with the second problem we tend to give enough information in case someone wants to go towards legal procedures.

Now, one of the most important things that you need when you’re looking at malicious behavior is the data that actually supports the kind of modeling that you’re doing. A lot of people work in security right now and do you guys have the best kind of data to deal with in terms of any kind of identification that you have to do in a tax or are their shortcomings that you actually face?

Audience Member:

There are shortcomings.

Ashrith Barthur:

And what kind would that be?

Audience Member:

[inaudible].

Ashrith Barthur:

Fair enough. This is not necessarily a problem that exists in the field of security. This actually exists in many different fields. Your data must compliment the kind of model that you’re trying to build so in a very fundamental sense you have to figure out what you’re trying to detect. Are you trying to detect individuals who are showing up fraudulent behavior or are you trying to figure out clusters or put it in a very simple sense, are you trying to figure out groups of accounts that are operating in a certain way?

Let me give you a very simple example. You could have … Let’s for example say that a bank tends to have some kind of vulnerability and using that vulnerability a lot of accounts might actually be compromised so when you compromise a bunch of accounts then eventually what it leads to is all those compromised accounts tend to behave in the same kind of way. In those situations are are expected to try and identify the clusters of those accounts that actually stand out than your normal banking accounts or in some cases where only one individual’s credential has actually been stolen then you’re looking at those individual credentials to try and identify this is the on that actually stands out. That’s essentially one of the things. Do you have enough information to identify an individual or do you have enough information to actually identify the cluster of the groups that you’re looking at?

The next thing is how quickly do you want to determine if there is a breach or if there is a malicious behavior that’s actually happening and this could very well depend on your risk team. This might not be up to the data science team or this might not be up to the business team as well. This usually comes from your risk organization where they say, “We are okay by taking a damage for about a day”, or “We are okay to take damages for about an hour or a minute or so.” Based off that, your models will be designed in a way where you’re predicting activities when it’s happening or you’re predicting activities when it’s just happened or at the end of the day.

A lot of use cases that we deal with and some of the use cases … I’ll give you simple example. One of the use cases that we deal with is fast credit card transactions. Fast credit card transactions is where a person has actually gotten access to your credit card information, not necessarily the physical card itself but the actual credit card and they can actually make a copy of the card, they can go to a different location, extract as much money, as much as the ATM allows and quickly get away with it. Essentially what happens is under those situations you don’t necessarily, you can’t actually identify it immediately but you get to identify it at the end of the day. There are different kinds of behaviors that you can identify immediately. There are different kinds of behaviors that you can identify in batch.

Now it’s also a negotiation between how much of a risk you hold and how quickly can your models detect? Now, your models might be super good. They might be extremely good with a very high ability to detect these behaviors but the other problem that stands out is is your model super heavy to actually detect these things? If you have 10 thousand features or if you have a hundred thousand … Can’t be hundred thousand features, I would hope not. Maybe a thousand features then you would have a problem because the model would take much more time in identifying these things so at those times there’ll be a bit of a compromise where your risk will have to say, maybe five minutes and the model will have to scale down in terms of the features that you actually put into the model. That’s essentially the question. Is the data [inaudible] actually at line speed?

Then comes the much deeper question, have you joined all the relevant tables to make the decision? If not, can you do it at line speed? One of the bigger ones is how robust is your model? Now, we understand that the model building process itself has been commoditized and what do I mean by that? Your model is not the loved one that you keep it to yourself anymore. You build a model every day. As [inaudible] behaviors change you tend to build newer models, you tend to build it faster and you tend to implement that really quickly and that’s essentially what is happening right now. And you actually have to do it because the terrain in which these kind of behaviors actually exist change very frequently. It could change within a day, it could change across a week so that means that you have to be quick enough to build newer models to be able to try and detect this behavior but for you to actually have the kind of robustness that you expect in the model, you need to have enough data. Is the data big enough? Is it long enough? Do you have a good enough history for about eight months a year? Usually eight months a year is a good enough number depending on the kind of behavior that you’re looking for or do all you have is just three months of data?

We routinely see a problem between different agencies we work with and this is not something that … I kind of write it out as something that I just think about as an actual problem but this is routinely a problem that we face when we work with different agencies. A lot of agencies do not actually have access to data more than three months which means that the data that we’re working on is in much lesser time which means that we actually have to build robustness into the model within that period of time.

There we go.

Now, you’ve got your fundamental sources of data. Now are there additional sources of data that you can look at which would probably provide you supplementary information that you can use for your decision making. For example, can you look at web server logs, can you look at network logs, can you look at your application logs, can you look at your account access logs? Now again mind you, adding a lot of information can give you a fantastic model but that does not mean that gives you enough speed. The ability to be actually proactive in trying and identifying this behavior. You might be seeing a subtle idea that I’m trying to throw that you have to be really fast and that is the kernel or the crux of how you deal with malicious behavior. It’s not a model where you’re trying to figure out, does this person get a loan or not? The person problem can wait for a day if someone wants a loan.

In trying to figure out this behavior there is a lot of loss that is associated with it which means that you are on the clock every time a certain behavior actually happens which means that you actually have to make very conscious decisions about how big your model is, how many features are going in, what is my risk factor, how much of data I can actually bring in and that essentially puts you in a good and a bad squeezy spot where you have to make these decisions before things go into implementation.

I’ve come to the next set of the actual what we do here is features. Now one of the biggest things that you have to decide when you’re building up features or rather one of the biggest things that you actually have to make a decision when you’re building models for malicious behavior is trying and building the right set of features that helps you identify whatever you’re trying to identify in this case and the features can be largely divided into two groups. These are not purely exclusive. They kind of overlap with each other. There is the individual and then there is the population. Essentially the reason why you’re building this is you’re trying to build a population characteristic. You’re trying to see how are all the members of your financial organization, of your agency that you’re working with, what’s the normal behavior that they usually carry and then you are trying to identify what is the individual behavior that everybody carries?

Now when you’re using these two as a comparison metric you’re trying to see how much of a variation is there from the individual to the population, from population to let’s say three months before or from [inaudible] to three months before or the week before or the day before. That’s essentially how you’re trying to identify if there is a variant across here. There are three primary very, very fundamental feature families that you look at for these individuals and activities.

These three feature families are attributes and activities of course. You’re also looking at interactions, interactions between entities. It could be a population, it could be a systems interaction, it could be an individual’s interaction and that’s essentially what you’re building. Finally networks, you’re trying to see how does a larger system, how is it actually working? What are the kind of interactions it’s actually leading to egress and ingress. I’ll give you a few examples and probably that will illustrate these things much better.

Let’s say you’re looking at a single individual and you’re trying to establish a baseline and when you’re trying to establish a baseline you’re trying to find out what are the kind of behaviors does this person naturally espouse. You’re trying to see does this person naturally transact using cash or is it usually credit? That’s essentially to establish what is the kind of instrument that this person usually uses when they’re trying to do transactions? Then you’re trying to see what is their natural set of interactions? What is the natural course and set of interactions that they do on a daily, on a weekly, on a monthly basis? That essentially tells you how periodic or how non-periodic is this person in terms of their activities.

This is actually one of the bigger clues that you get when you’re trying to move to actual interactions between different people. Then you’re trying to see what is the geographical identity and the geographical identity, I don’t mean race. I actually mean what is the IP address you’re coming from, what is your usual activity location, what kind of activities do you conduct from a certain kind of location for example. Let us say I’m applying for a new credit card. I would probably not be applying from an airport terminal. I would probably be doing it from my house or I would probably be doing it from a residential location because those or the kind … There is a certain association with when you do these kind of activities and you also try to see the periodicity or where you come from. Is this person naturally residential? Does this person conduct their activities from the office as well or usually when they’re mobile and essentially that is what establishes the kind of probabilistic score that you can associate with in terms if you know how much of a risk you can provide.

How does the social identity fit? Now, you can take information that you probably, you can get from social infographs and you can bring that in and you can say how many people does this person naturally interact with? Is there a natural … Is there a sense of a lot of interactions, one of interactions, periodic interactions or complete randomness and that is something that you can establish when you’re trying to identify a social identity of the person as well.

Then you’re looking at different kinds of ingress and egress. This ingress and egress can be of many different kinds. In this case we’ll problem just focus on the financial instrument. Now, if you are to look at my bank account the only way of ingress that I get in my financial account is my paycheck but then there is egress for natural utilities, there is egress for credit cards, there is egress for different kinds of payments that I necessarily have to do. With that too you can establish a certain kind of periodicity and then you can say there are some that are not necessarily periodic but are not necessarily risk worthy because they’re not that active.

You can also try and these past triggers. Past triggers are actually one of the biggest identifiers that help you detect whether this person has a natural position to actually [inaudible] interesting behaviors. You would look at let’s say how many times has this person triggered off, or how many times has this person triggered off a loss of credit card? If there is a loss of credit card then you know the risk quotient for this person can actually be higher than the rest of the family that you associate with. You can also look at volume rates in terms of what is the rate at which this person operates in terms of their amount. You can also look at natural, the amount of volume itself. You can look at that as well as a measure of how risky or how not so risky it is.

Operational window, the idea is very intuitive. It tries and finds out when are you actually active. A lot of people tend to pay bills, tend to pay credit cards and all these things over the weekend. We don’t necessarily do all these things on the weekday. That essentially puts you in a separate group. You can establish this kind of information as well.

Now the thing is with all these identities that you establish you can also establish a second form where you replace the individual by population so where you segment the data and you say, okay this person belongs to this group, how does this group naturally operate, how is this person, how much of a variant does this person from the group that we expect them to belong and that will essentially give you two kinds of measures. It’ll give you a person’s measure … I mean it will give of course the population metrics, population statistics for all these things that you measure and then it’ll give you the individual statistics which you can use and compare and try to find out how much of a risk quotient you can actually associate for this person.

The next set that we look for again is interactions. Interactions are another beautiful set of features that we use which are actually super useful when it comes to predictions. Sometimes we don’t necessarily have a good amount of individual or population statistics. At that time we use interactions as … Interactions tend to stand out as good features when you’re building these models. In interactions you’re looking at age groups. What is the …

Do you have a question? Sorry. Yeah, okay.

In interactions we’re looking at different kinds of age groups. Different age groups tend to operate in different ways. Usually age group of 50 and higher, maybe not 50 anymore I would probably say about 60 and higher tend to operate much more in cash where what you see is … You have to be watchful also based on the kind of instruments that you’re looking at. For example, a person might actually withdraw cash and give it up to someone and on the other end the person might actually get cash into the account so although that’s not a detectable link with the probabilistic measure you can say that this amount was essentially transferred to this person using cash. For that you can actually use different kinds of metrics where you say … You can use the ingress and the egress where essentially you try and identify how much of money’s actually coming into the account and how much of the money’s actually going into the account based off the instruments that the person is using.

One of the indicators that we use to try and segment across the group is age which tends to help us identify different age groups and clearly figure out what’s happening out here. Loyalty, not a subjective thing. It’s not brand loyalty. It’s more like how long have you had this account and let’s say, how many people have you actually tried to bring in to the financial institution or the agency for that matter, would be an interesting measure. Many number of years you tend to have a low risk score. Not necessarily so many years. We don’t necessarily give a low risk score but we definitely associate a risk factor with that person. Then we also look at the periodicity of interactions. How frequently do you pay your kids? How frequently do you pay the bar ticket or any of those things?

We also look at seasonalities. Every time the Super Bowl happens a lot of people put in bets which essentially means that a lot of money starts to move into certain kinds of accounts. That is not necessarily dangerous. It’s just a seasonal thing and one of the other things that we also see is … This is especially in not America. We see that when it’s summer there tends to be a lot more activity in terms of transferring money because people are traveling and they’re trying to share money or pay back or something. Seasonality tends to increase the mode of interactions as well.

There is also system interactions. How much of this entire system tends to move periodically? Do all these people interact, were they all interacting for the last three months, were they all interacting for the last six months, were there any new entities, were there any old entities, did things change and very similarly you also looking at the volumes trying to see how much of this interaction tends to carry over a year or six months period a month? You also look at the interaction rate. How many times does this person actually interact with another person?

Now again just like how we did for an individual you can also look at interactions across a population. Now the interactions in population is very interesting because individuals, it’s very easy because there are individualistic attributes. You can cluster them and it ends there but across populations you’re not only looking at cluster interactions but you’re also looking at interactions between clusters. That essentially puts you in a very good spot to try and see how mobile is this person, how mobile is this interaction from one cluster to another and that essentially gives you a good measure as well.

Now mind you, these two sets of things I’m actually talking about are the kind of features that you actually build for your model. Now these features actually go into your model and help you figure out if something is actually a real transaction or something is actually a fraudulent transaction or some kind of a malicious behavior in the system. That’s essentially what we’re getting to. Now what you’ve seen in the first set is interactions of individuals or specific set of population. I mean, sorry, attributes of individuals of [inaudible] population and here you’re looking at the same set but you’re looking at it’s interactions.

The next one is actually a bit more interesting where you’re looking at a network. Now this tends to blow up the problem just a bit but it also gives you a lot more information in terms of what you’re trying to find out. One of the biggest things that you’re trying to find out is is this network convergent or divergent and the reason is a lot of money laundering techniques, fraudulent transfers of money tends to have convergent networks so there’ll be 10 people … How many people here have actually heard of a concept called smurfing?

Great. It’s not the smurfing in the network sense. It’s the smurfing in the financial transaction sense but it’s the same principal. The idea is very simple. I don’t want any financial institution or any organization to actually detect that I’m transferring 10 million dollars and what I do is for about a thousand people I break up that 10 million dollars and give it to them and ask them to deposit into their account. After a set of days I ask them to transfer it to another person, person x. Each of them will have … Let’s say a thousand people, it’ll converge to a 50 people network and then 50 people eventually converge to let’s say 10 people and these 10 people tend to actually withdraw the money and then it’s taken out of the system without any detection. That’s actually fraudulent. That’s a form of money laundering which entitles you to get a very fantastic [inaudible] case but that’s of the course the other side of the story.

Those networks tend to be fraudulent. Sorry, those types of networks tend to be convergent because one of the things that people who are actually conducting these kind of exercises, what they want to do is they just don’t want to leave the money in your bank account. They actually want to take it out which essentially it’s like their achilles heel. If we can catch them at a point where the networks are convergent, we exactly know how things are converging and if it’s a normal operational behavior the networks don’t necessarily converge. Yes, of course it converges to PG&E but PG&E is not doing money laundering. We know that. That’s essentially how things work.

Which is one of the reasons why we look at larger and larger systems so you can look at this at different sizes. You can look at the convergence and the divergence of a network across segments, within segments, within subsegments or within clusters. You can look at it in many different ways to try and identify how is this entire system or network actually moving? You’re also trying to see what is the operational periodicity of these networks? Is this transaction a one off thing that’s actually coming in to the network or is it periodic? Does this kind of money actually move across the network periodically?

You also try to see what is the active time? How much across a given period of time is the network actually active? How much of activity does this person tend to have and you look at something that’s very interesting. It’s called lock level. Lock level is nothing but it’s just a window which you use to actually look at different parts of the network at a given period of time. You’re looking at many different areas across the network to try and see if something that’s actually coming out or if a part of the network that we’re looking at, is it getting in whole or a big chunk of it is it actually getting into a different part of the network as well and that essentially tells you that this network, that these set of individuals or these set of clusters are actually operating as one network.

If you’re lock level mapping, so essentially let’s say three people out here and three people out here are not necessarily transferring anywhere close to let’s say 80 percent to 120 percent of the money that they’ve got within them then essentially you’re trying … You can say that they’re not necessarily in the same lock level and that helps you try and identify whether they’re part of the same network or not. You’re also looking at any recent changes in behaviors. Did the network grow? Did it grow in a much larger fashion?

For example, this is probably going to be very interesting. Across Europe one of the things that we see is we see dead people coming back and that’s not necessarily a cool thing. It’s just accounts of dead people actually comes back. That’s all. There’s nothing else to it. What happens is there are times when these accounts actually come back and the number of people in the network actually increase. There’s a really large number of people that swarm the network and then you see that there is a lot more activity than you would expect it to be. Those are the recent behavioral changes that you’re looking at.

Then you’re also looking at … One of the things that you also want to do is you want to try and find out who are the probable sent of culprits that are sitting on the network? You’re essentially looking at the last egress point. Where is this money actually coming out or where are these transactions actually converging into? Is it the same person all the time? As I said it could be PG&E or it could be an actual person. If someone is malicious the egress points tend to change quite frequently. That essentially gives us an indicator to try and highlight and say, this person was the egress point for this network three months ago but the same person is not the egress point for the network anymore and that change actually helps us identify that there is this interesting behavior across this network.

One of the last things of course, probably one of the most valuable things that we’re looking at is pass through to origin ratio. Essentially how much of transactions, it could be volumes or it could be merely counts of transactions so it could be volumes or values. How much of money actually passed through the system, how much actually got retained in the system, how much actually converged much before the egress point in the system? That gives you a good indication of a segmented set where either there is a lot of fraudulent transaction happening in it or it’s not necessarily fraudulent, it’s just a diversion network and everything seems to be hunk dory.

As I said … Let me quickly jump back to this. As I said there are three primary things that we try and build when we’re looking at malicious behaviors across financial transactions. Attributes and activities. This could be individuals of populations. Interactions and networks. Yes?

Audience Member:

How do you go about trying to demarket the network or find boundaries in a network?

Ashrith Barthur:

That’s actually a fantastic question. For people who couldn’t [inaudible], how do you demarket the network is the question. The idea is we use natural segments to first try and identify if the behavior seems to be into a cluster. Within the cluster, does the behavior tend to be the same? Does the behavior tend to pass over to another cluster? So we try and figure out if the clusters are exclusive or there is some crossover in terms of behavior. If there is a crossover then we recluster. You could also have a large cluster which could be subsegmented but then two segments of two different clusters could be the same so then we recluster them as well. This part I would actually say is a much more trial and error and a much more like a [inaudible] problem rather than [inaudible] here you go, this is a solution. You have to look at the data and you have to do it multiple times to get the right set.

Audience Member:

Is the starting point [inaudible] often or demographic or geographic region or is this hitting too close to home?

Ashrith Barthur:

So that’s actually a fantastic question as well. The starting point … We use both because we work with organizations around the world. We tend to use geographical boundaries as a natural starting point for us but then we automatically introduce the institutions as well as next subsegment for that.

I’ll switch to the next part. If there are no other questions I’ll just switch to the next part which essentially speaks about how we look at the models as well and one of the reasons why we’re looking at models is we tend to have a lot of data scientists and mission learning engineers who work with me who tend to say, “Oh, I’ve got this to nine five or nine seven. How do I make it nine eight or nine nine or whatever?” Something that I very strongly believe in is those numbers don’t actually matter. What matters is how much of a problem are you solving based off the risk factor that your institution actually gives you. If you’re working within the risk limit that your institution actually provides you then you’re fine but you can of course improve the modeling by making better features, by probably attuning your model as well because to be fair, almost all modeling techniques right now tend to be really good and especially for trying and identifying relations behavior I would say that I have found good and bad results with all kinds of modeling techniques as well so there is no one go to model but I would say everything works the same way but of course, I do a lot of random [inaudible] myself just because I like it. That’s about it. There’s nothing else to it.

In terms of the model there are a few fundamental questions that actually comes [inaudible]. One of the fundamental questions that … Of course one of the things is what are you actually trying to solve in this case? Are you trying to solve an ML problem or are you trying to solve an AI problem? There is a chasm between these two and we don’t necessarily try and figure that out. We tend to speak about it in the same breath and that’s essentially not fair and I’ll probably speak about it just a bit but if I were to put this question out to you guys in the audience how would you look at it? Are you guys solving an ML program or are you guys solving an AI problem? Anyone could … It’s perfectly fine. The mic is free.

Audience Member:

Are you trying to differentiate between a research problem versus-

Ashrith Barthur:

Not exactly.

Audience Member:

[inaudible].

Ashrith Barthur:

No, no, no. All right, if nobody wants to [inaudible] I’ll try and not speak about it. The idea of it … Actually, maybe the next slide might help. The idea of an ML is much more just classificational, what you’re trying to do is you’re getting a model to actually learn a certain set of behavior and then you’re using that predict a certain set of behavior. In AI there is a bit more to it. You’re not just … The whole process of ML seems much more linear. As I said, if I were to give you an example, very simply put, in ML you’re just trying to find out whether it’s fraud or not. You are essentially concerned with the very simple decision that you’ve given it a set of data and you want to try and find out whether there’s fraud happening or this transaction is fraudulent or this transaction is not fraudulent. It’s a very linear problem and what essentially … The question you tend to ask is what is the classification or what is the group that this transaction actually falls in and this problem actually is very linear. It doesn’t necessarily give you any insight into what you’re looking at. Your model doesn’t necessarily carry the insight. It’s just I’ve learned a set of things, based off these set of things I know how to predict and that’s essentially what an ML problem is.

But when you switch the same thing to AI the solution is actually emergent. What you’re trying to do is you’re not just trying to classify whether it’s fraud or not but you’re also trying to classify what kind of a fraud is it. It’s answering a much more deeper question. You’re not only using the data set, you’re not only using the features and the techniques, the domain knowledge and all those things that you’ve got to answer the basic question which is is this fraud or not but you’re also trying to identify is this actually a certain kind of fraud or if you want to put it in a way what type of fraud or why is it a fraud. That’s essentially what you’re trying to identify.

There are two fundamental differences out of it. I think David [inaudible] probably does a much better job at explaining what emergent is so I would probably defer you guys to him but yeah. That’s essentially what you’re looking at and that is what you’re supposed to do and if you’re looking at merely a mission learning problem then a very simple model that helps you classify if it is a yes or a no makes sense. That’s about it. That’s essentially what you should be looking at. But if you’re looking at problem which is much more AI or that you’re trying to find out is this impersonation or is this transaction fraud or is this personal fraud but sitting on your machine and you have these things, you are looking at much more sources of data or much more feature sets, a very rich set of features that actually make sense. From a domain point of view you’re able to understand what these features are, you’re able to help yourself classify, you’re able to actually identify what this behavior is and that essentially makes it an emergent problem.

Audience Member:

Does…

Ashrith Barthur:

Yes, please, yeah?

Audience Member:

[inaudible] AI probably can automate the decision associated with understanding-

Ashrith Barthur:

Absolutely, yes. Yeah. It really puts you much more far away from this. It puts a bit of a distance between you and the system. I don’t want to say puts you far away from the system. That sounds a bit scary. We don’t want that but yeah, it puts a bit of distance between you and the decision making because the system itself is intelligent to make that decision.

Audience Member:

[inaudible].

Ashrith Barthur:

Yeah. I think there was another question. Okay. Yes?

Audience Member:

Is it like there is another ML problem after a bigger ML problem?

Ashrith Barthur:

It actually is. Yes, you’re absolutely right.

Audience Member:

So why would it be AI? [inaudible].

Ashrith Barthur:

So that’s a fantastic thing and this of course is subjective and that’s essentially the idea of what emergence is. It’s much more than the composition that it actually has so your problem could entail many different models eventually by the time you actually get that intelligent decision making or it could be just two models and your system that you build with all these models, could be smart enough to actually make that decision but you’re absolutely right. It could take many models to get you to that point or it could take you just a few models to get you to that point. Yes?

Audience Member:

Do you also use a [inaudible] technique for AI instead?

Ashrith Barthur:

I would beg you not to do so because the only thing that happens is when you’re using a [inaudible] approach for problems of this kind it tends to flatten the problem and here this is a sense of hierarchical approach. If you know the construct of how intelligence is created, you have data then you have knowledge and then you have intelligence and then you have wisdom so the number of models that you actually put out eventually puts you to the pyramid where you have intelligence but if you flatten the model then you’re necessarily just stuck at a place where have a data of knowledge. Yes?

Audience Member:

Yeah, I guess what you’re saying is ML is one specific model predicting one specific component of a larger system.

Ashrith Barthur:

Yes, yeah.

Audience Member:

AI is the larger system because here what type of fraud is it? I could have maybe multiple labels of fraud.

Ashrith Barthur:

True, yeah.

Audience Member:

It could be expanded upon beyond binary.

Ashrith Barthur:

For that I’ll-

Audience Member:

That’s still a machine learning problem, correct?

Ashrith Barthur:

I’m sorry, say that again.

Audience Member:

If I had three categories or 300 categories of what type of … That’s still an ML problem.

Ashrith Barthur:

It is. Of course it is and that essentially goes back to what the gentleman was speaking that it’s actually a sequence of multiple models that eventually gets you to a point where you’re actually making intelligent decisions and for very … I’ll give you a very simple example of how this approach is. You have a model which actually is telling you whether this transaction is fraudulent or not. The model is not smart enough to actually identify what kind of transaction this is or what kind of a fraudulent transaction this is. You might need another model with an extra set of information which is capable enough to actually classify what kind of a fraudulent transaction this is and that’s essentially … It’s like a chain of models that will eventually get to your point.

Audience Member:

Yeah, yeah that’s cool.

Ashrith Barthur:

That’s essentially how we build it. That’s essentially how we build the system because a lot of the decision making that financial institutions want who essentially work with us, they want to try and keep as much as possible to an objective view. In the sense they want systems to do these things as far as possible and that’s essentially what we’re trying to build when it comes to identifying relational behavior and there is a genuine reason why they want to do that. It’s because everything is becoming electronic. We don’t necessarily go to the banks anymore. Much of our transactions, most of our transactions happen online.

Way back, I think in 2012, credit card agencies were running about probably close to 30 thousand transactions a second. Now every day, I’m sorry, every second they tend to have about 300 to 400 thousand transactions per second which essentially does not give you good enough cushion to put enough people to actually [inaudible] something is fraudulent or not which is one of the reasons they want models that can actually make these decisions themselves and come to a point where it says, hey you know what this is what I actually think it is which is why we actually take this approach of building the models.

Having said that, you could have many different ways in which you do this approach. An ML problem usually almost always is a supervised approach. In whatever [inaudible], I don’t necessarily have seen an unsupervised approach in the models that we build at least of fraudulent techniques. In AI it could be a combination of both and essentially for this very example, for trying and identifying fraudulent transactions we’ve built supervised models the outcome of which tends to get fed into an unsupervised model which we are trying to use and classify what kind of a fraudulent technique is this and essentially of course the whole pipeline has to be as lean as possible and this is what we usually implement in most of the financial organizations that we work with.

Sorry. So having said that, this is basically a set of … This is a slide where, I’m trying to say how our models have actually [inaudible]. Then of course we have started off, we went through the idea that oh you know we can solve the whole intelligence problem with one model. Then you build a model but that doesn’t necessarily get you to the point so you necessarily scale down.

Most of the systems in the current day, most of the systems across the world … Actually, I wouldn’t say current day. A lot of them are switching. I would probably say about two years ago they were super rule based. There were many different systems which were extremely rule based which are trying and identifying fraudulent transactions, money laundering or any of these malicious activities using rule based systems. The problems with these rule based systems were you could very easily skirt around this and you could miss the detection which means that it was very, very easy for you to get through the system.

The way we evolved it is we took up the rule based systems, we built feature sets, we built the model [inaudible] parameters and then we used classification checks. For classification checks we actually put a human being in the system to try and verify whether a certain set of classifications were actually doing a good job or not. We use the techniques to actually rebuild much better models and essentially we are at a point where we have done the classification of ML models using these techniques so from a peer rule based systems we have moved to a classification model.

Now, the next approach that we have actually built as well is like self discerning models. Models that are actually smart enough or intelligent enough to actually tell us what is the kind of fraudulent activity that is actually going on and for this we don’t necessarily need classification checks. We have built enough learning that actually comes out of existing models but we still use the features, the set of features that we actually spoke about. We still use parameter tuning to try and tune the model to [inaudible] risk factor, best specification and we’ve come to a point where we have models that can actually make a decision on their own. I wouldn’t say they are super intelligent. I would probably say they are intelligent in a very narrow way, very humbly put I would probably say. They probably can’t do anything else other than trying to identify 10 sets of fraudulent or maybe say it doesn’t belong to any of these 10 sets it could be that one more new set of fraudulent activities that I am not able to classify but you need to take a look at. Essentially that’s where we are right now. Maybe we’ll see some more development in the future.

Yeah, having said that I think I’ll probably stop here and I’ll open it up if you guys have any questions or anything at all, comments. It would be great to say that. Thank you. Yes?

Audience Member:

Two questions. One, are there certain combinations of integer values, not necessarily obviously if there’s a huge amount or very small amount that could be anonymous but actual numbers that you just don’t see in combination or per mutation very often as a transaction that seem anonymous?

Ashrith Barthur:

I’m sorry I think I missed after the …

Audience Member:

So an example would be you’ll see a lot of different transaction values like 99 cents or a dollar or something like that but if it’s a combination of numbers that you just don’t see very often as a value?

Ashrith Barthur:

Okay, so that I think is a fantastic question that leads us to some of the features that we built. We also look at sequence of transactions. Let’s say you usually do 99 cent transactions quite a lot and then you suddenly do a thousand dollar transaction or let’s say even a hundred dollar transaction, it tends to create a feature which creates like hey you’re seeing a sequence of transactions which is very interesting. This needs to be a feature so let’s add that in as well. We do tend to add that as well, yes.

Audience Member:

Then the other question is, is there certain things that are fraudulent or fraudulent trends that you train on and you may catch them early on but then you’re not training on those anymore because they’re less prevalent?

Ashrith Barthur:

Okay, yes. We tend to face that problem but most of the institutions that we work with are not necessarily completely proactive. Partially proactive is how I would put it so that means that we still retain trends that were there yesterday but we still make the models oblivious to the trends that were there so it could have two different models where one is necessarily just trained right in to find something interesting while the other one is trained to figure out what are the trends that this fits in. Yes. Yeah?

Audience Member:

Very early in the talk you mentioned concerns around latency and keeping that down. I’m curious what are some of the things that end up being most challenging from a latency perspective and how you think about addressing those?

Ashrith Barthur:

One of the biggest problems that we face when we’re addressing latency is having a comprehensive set of data but not being able to bring that together to actually make a decision and this is … If I were to put it down to a very simple problem it’s the problem of joins between many different tables. How do I bring it together? At that time what we do is initially what we used to do, we used to go over the idea that we’d join everything together at line speed or just before something is going to happen and then use all the features to make a decision but now what we do is we have initial set of models that’ll actually tell us what class of problem this might be, what class of a fraudulent behavior this might be. That means that there is a selective set of features that actually get joined, not necessarily all the tables and that helps us make a decision so that cuts down the line speed, the amount of time it takes for us to process things at line speed but that does not mean that we are still capable enough.

After that the next step would actually be a curated set of features so we just … For example, some of the models that we have built have about thousand 700 features. They don’t make any sense when we are making the decision. We use just about 100 features. We bring it down to a much more smaller level to be able to make the decision much faster. That reduction is also a place where your risk team comes and tells you that hey you’ve brought down the features to 110 but the risk is increased by a certain factor. We can take it or we can’t take it as something that … It’s a discussion that actually has to go through.

Audience Member:

I guess a followup question to that would be is there some sort of offline model where [inaudible] maybe instead of having a second latency you have as long as you want for some [inaudible] job where you go back and revisit these?

Ashrith Barthur:

Yes. One of the other things that we do is this is an online model. I wouldn’t say online. It’s near line speed. We also have another model which comes through after the data comes into the system and we know it’s been classified in a certain way. We have another model which builds all the features that we have actually put into the model and then reclassifies that thing as right or wrong based off what was classified earlier so if there is a difference it actually gets fed through a manual handler who puts it back in the system and says, “This is where your classification was wrong.” End of the day, end of the week a new model building will be triggered. Yeah?

Audience Member:

I have two questions actually.

Ashrith Barthur:

Please, yeah.

Audience Member:

For this, just automating the [inaudible] so before [inaudible] how do you translate learning into your models [crosstalk]? Second part is what [inaudible]?

Ashrith Barthur:

Okay, so one of the things that we have done is try and understand how people make decisions like investigators who have made decisions, we try and see what are the important things that they have actually made decisions. We don’t necessarily take every aspect or every type of I would say assistance that they’ve used to actually bring it, make it into an object and bring it into the model. We use things that can actually be numerically or represented in a classificational way and that’s essentially what we bring in.

If someone new is actually getting into this, into the investigative field, I would say it would behoove that person to actually try and understand how a model is actually classifying because one of the new things that’s … There is a new problem that’s coming out of this and I don’t know if a lot of people know about this, is wrong classification. Of course the model is not perfect so a lot of the things that have been wrongly classified we randomly pick out of this set of classifications that we haven’t rescinded to manual handlers. If these manual handlers know how things work, [inaudible] of course knows how these things work and they can come back and identify saying this is wrong or this is right because of a certain feature that we put into the model. That would be an extra set of skills that would be very valuable. Yes?

Audience Member:

Yeah, my question is on the classifying part [inaudible] of features so especially the network of features. I just wanted to ask in general how do you guys generally the features that you have different classifier that classify the same, for example [inaudible]. Do you just push a set of transactions to a classifier and it tells you is it [inaudible] and use that as a feature [inaudible] or you have one model and you push all the transactions [inaudible]?

Ashrith Barthur:

We have one model that actually does this for us but the network features that you talk about, the network doesn’t necessarily change at line speed. It changes in a much more slower pace so we tend to precalculate this much before, like at the start of the day for example. We tend to recalculate this every 24 hours, every week and we keep it ready. Sometimes what happens is there are places where we actually recalculate this ever hour which means that there is a set of functionality that is actually happening at another place where a new set of features are coming in but some places are very comfortable by having these network features for a day or for a week. Yes?

Audience Member:

[inaudible].

Ashrith Barthur:

Please, yeah.

Audience Member:

Let’s say you classify something as [inaudible] or whatever the classification you’re doing and if I ask you why did you say it’s fraud, how do you go back and infer what feature was responsible for that? How do you answer that question?

Ashrith Barthur:

Fair enough. If you actually look at it … Let’s take any of the features that we’re using. Let’s say there was some individual feature, I’d probably take network because that’s probably some of my favorites. You’re looking at some of the network features out here and some of these features actually come by and tell you that there is a certain reason why your transaction is classified as fraud. Now if you look at convergent or divergent for example or if you look at let’s say lock level or recent change in behavior. These are intuitive features. These are not just communitorial features that you build but you don’t necessarily understand.

These are internal features that you can actually take it to your business and say, “Hey, look, there are a set of entities in this network that actually have a convergent behavior which is one of the reasons why based off the network part of the features that we built for this model, it tends to say this is fraudulent”, or for example a recent change in behavior. The number of entities could’ve increased which essentially says that because of this there is a complete change in how the volumes are actually happening in this network and that is actually a true feature that you can take to a business and say, “Look, you understand what this is”, and our model is saying that this is an important feature and that’s essentially one of the explanations you can give.

Audience Member:

Basically there is a human being that is sitting behind the model who is going through your 110 features?

Ashrith Barthur:

No, there’s no human being per se. There is a human understanding to it. This is not just an x and when you figure out why is your model classifying something as fraudulent it’s not saying because of x. You actual understand what x is and that gives you legitimate reason for your business to try and figure out, oh okay maybe we have a problem here, maybe we have a lot of money laundering that works in our financial organization and that gives you an intuitive idea of what’s happening but if [inaudible] is convergent and I just calculated a feature … Let’s say I blindly build features, all communitorial possible values that I got I just built a feature and I threw it into a model and I predicted the model.

How do you explain to anyone what that is? You would only be able to explain from a very experimental idea that this feature actually results in certain behavior but now can you say that that feature is consistently going to predict all the other behaviors? Maybe not. Which is one of the reasons we focus on building features that have more intuitive, understandable for a human being as well and for the models because you also have to understand fraud and money laundering and all these things. There is an element of legality associated with it which means that in a court of law you will actually have to be able to prove that this thing is how it is. I can’t just take in feature x and say, “This feature x said this is fraudulent hence we are filing charges against you.” That doesn’t necessarily happen.

Audience Member:

So you basically want to build into the features [inaudible] with [inaudible]-

Ashrith Barthur:

We want to build into the features. We don’t necessarily want to analyze them but if there is a point where you need proof of why a certain thing, we classify a certain behavior, there is an intuitive feature for you to use. That’s essentially how it is. Yes?

Audience Member:

So does that mean in dimensionality reduction techniques like a no go?

Ashrith Barthur:

Yes. We do not go towards that at all unless until we are in such a bad shape that none of the features that we are building are actually making any sense. We have never ended up in that situation and the reason is because what we’re looking at here is malicious behavior which means that there is a human element in it. Human beings are the only ones who do this. Systems don’t do it, every other living organism probably doesn’t do it. I don’t know. I don’t want to say that.

Audience Member:

[crosstalk].

Ashrith Barthur:

Fair enough. Yeah, see there you go which is why I said I don’t want to say that. We know humans act in a certain way which is one of the reasons if we build features that can identify these, which can classify these behaviors it’s comfortable for us to explain it. Yes?

Audience Member:

Do you incorporate different costs for false positives versus false negatives?

Ashrith Barthur:

Yes, we do. Yes.

Audience Member:

How do you think about the cost of a false positive these days?

Ashrith Barthur:

There are actually many different cost factors that we associate with. One is the false negative as well which actually has got much higher than the false positive. Our models tend to be … We usually try and go for the [inaudible], try and produce false negative as much as possible but the cost factor for a false positive is high enough so that we don’t necessarily run into any kind of trouble so that’s essentially how we design it.

Audience Member:

So you’re not just out putting a confidence that the client will then consume and interpret as they want, you’re giving an actionable-

Ashrith Barthur:

Yes, yeah and there is a variable component to the cost factor as well and that variable component comes from certain risk coefficients that are actually associated with certain primary features. For example, if the transactional amount is super high we tend to provide a reasonable high cost factor to it so there is a variable component to it as well.

Audience Member:

So, when you say cost you mean so for the matrix?

Ashrith Barthur:

Yes, so basically if the model is predicting a lot of false positives or if it’s … Essentially what we do is we penalize the model and that’s the cost factor the gentleman was talking about.

Audience Member:

Just curious, [inaudible] but you also customize it as per your needs?

Ashrith Barthur:

No, no. [inaudible] is always what is used because we do not want to be an association where we missed things so we try to reduce false negatives as much as possible so we use [inaudible] but even then we are having a larger cost factor for everything that the model gets wrong and that’s essentially what we’re turning on.

All right, any other questions? If not I think … okay. We are at 7:40 so maybe it’s time to … Yeah. Thanks guys.

Building ML Models to Detect Malicious Behavior

Read the Full Transcript

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights

FOR MODEL BUILDERS

FOR DATA SCIENTISTS

FOR ENTERPRISE DEVELOPERS

Building ML Models to Detect Malicious Behavior

Read the Full Transcript

Ready to see the H2O.ai platform in action?

Why H2O.ai

Products

Resources

Insights