Cybersecurity and AI with Ashrith Barthur
In this talk, H2O.ai Security Data Scientist Ashrith Barthur presents solutions to make cyberspace secure through feature-rich, robust, and yet lean machine learning-based algorithms. These algorithms help organizations identify malicious actors, intruders, and illegal system access by studying features that arise purely from system login behavior.
Ashrith Barthur, Security Scientist, H2O.ai
Read the Full Transcript
Hi everyone. I'm Ashrith and I'm a security scientist at H2O. Well, I think the first question might be what's a security scientist doing at H2O? and the answer to that is we are actually trying to build things above the basic stack of H2O itself, and that's where more in the business applications come into the picture. And this fits in one of the micro app architectures. So one of the few problems that we look at when it comes to network security. Malicious threats from the insiders, from the external people. We have distributed denial of service attacks, we have data loss, and we also have user behavioral analytics. And you have to understand that most of the network security field is very paranoid. So they run with rule-based algorithms. They don't want to miss out on any single incidents, which is one of the reasons why they try to refrain from using machine learning for their algorithms.
Cybersecurity Applications of Machine Learning and AI
And that is something I'm actually here to change. So what we see here is we use rule-based algorithms. We have a lot of experts who analyze data, who tell us what is right and what is wrong and they investigate different kinds of situations and then they come back and say whether something was positive or something was not. Now how can we change that? That's actually our big question. Some of the reasons that we need to change this is because it actually takes time and the process is slow, but it is still justified because we don't want to miss any of the incidences that might have happened, as I said earlier.
So let's see. What we are trying to do here is we are trying to look at this use case. And so I'll be speaking about two use cases out here and that'll actually give you a perspective of how we are trying to fit in machine learning in the process. So again one of the things is the consumption of time. A large amount of manpower is required and the process is very slow. And this is a problem when you have lots of data that is coming from different security incidences because you actually have to go through all this in a short period of time and come back and say whether an incidence was actually right or wrong or did it actually happen or was it a false positive? And there is a very limited number of people in this field, frankly, there are a lot of professionals who operate on data, but there is a limited number of people who can actually understand every incident in this field. And that's where the whole machine learning process actually comes into the system.
Identifying Anomalous Behavior and Outliers
So a simple difference is, Most people assume that security-identifying behavior in the field of network security is just identifying outliers. It's not exactly the same thing. So most of what we do is identify anomalous behavior. And anomalous behavior does not have precedence, usually does not have precedence. It could be a behavior that exists within your normal behavior. It could be a behavior that exists with the outlier, it could be anywhere across the chart, but it's not exactly identifiable, but an outlier, although low with, although it has a very low probability, it does exist. And that's the basic difference between our outliers and anomalies when we look at network security incidences. So identifying anomalous behaviors is actually quite difficult. You have to actually model your data in a way that you can identify these behaviors. So how do we go about doing this? What you do is you first create something called a context.
Analyzing Data Under Specific Context
And the context is a primary idea under which you can identify different kinds of behaviors. So what a context is, is an actual scope, or a framework under which you start analyzing your data. So let's take a simple example of people coming through turnstiles in a gate. Now, if you were to see someone who is not expected to come into a building, but if the person happens to be walking in, how would you identify that that person actually walked in, was not allowed in the building? And that's where you create a context. It's like what time did he come in? Did he come in on a holiday? It looks very much like a rule-based system, but that itself is very flexible. And I'll tell you why very soon. So as I say, it's very rule-based, but it looks very rule-based rather, but you do use the same system to make this effort.
So what do you do is, as I was talking about it, let's take a simple example here. I know there's a lot of text, but I'll take you through this very quickly. So let's assume that you have the use case that you're working on is a Windows login use case where you have users logging in. All you have is their login and their successful login times. Now could we identify which user is actually malicious or which user is not? And so to do that we actually have to create a context and the context goes as such. So with Windows, there are different kinds of users. You have system users, you have administrators, and you have actual system accounts which perform different kinds of things. Now that it starts, it gives you an idea of how you separate your data. The moment you start separating your data based on the kind of users, the data you're analyzing for these different groups are actually going to be different. The next thing that you do is Windows login actually happens in different kinds. So you have physical logins, you have network-based logins, you also have remote logins, and you have terminal-based logins. So that too adds to the context that you create. Now, interactions between these two, interactions between users, and interactions between these different kinds of logins actually create the context that you want to identify and recognize which part of the behavior is anomalous. And these interactions are what we study.
Analyzing Use Case Data Clusters
So here, if you see, this is one of the use cases that we were working on very early on is the Windows use case trying to identify anomalous behavior. And here if you see, you just see very simply you see three clusters. This was very early on in the work that we did. So here when we break the data down, we see this is actually an administrative user that's a system user. And the final one is system accounts. So each of these data blogs, or you want to say clusters, actually gives us the concept of the context that I'm speaking about. You divide the data in a certain way and then you analyze the data in that family or in that context. And if you were to add the type of logins as well on top of this, you would see that the data would further divide itself by forming different clusters within these three different user groups.
What is the problem with this is that, With all the algorithms that we develop, with everything that we do, we can only predict to a certain extent that a certain login could be malicious. We can't say with a hundred percent accuracy that it is actually malicious. And that is a problem because what we are telling the business users is that we do not exactly know whether this is bad or not. We just can predict with a certain amount of probability that this is bad. And that's not a good thing, specifically when you're dealing with security people because they want to know if an incident actually happened or not. And for this, we actually use So if I go back to the slides,
So as I spoke here, we speak about experts and professionals. So for the prior, we use these experts and professionals to help us out here in terms of understanding the data. So here what we do is we use them, we actually shot list the data that they need to analyze. We use their help to understand different processes that are actually valid to be identified as malicious and the ones that are not. And with their investigation, we actually get to know whether something that we had identified was malicious or was it just a false thing, a false positive.
Modifying Use Case Context
And so most of the work that these people do, If I just go back to the words that I said, there is a case that your context is actually very flexible in the sense that you can change your context, you could modify it. And this primarily happens when different families that you have identified in which you're building context can actually be homogenous. So some of them, for example, like a user when he logs on remotely or through a network system could behave in a similar way. So you could merge those two contexts and make them look like one. You can have different thresholds for different contexts. So that tells us that certain behavior is necessarily normal in one context, but it is not normal in another one. So that helps us identify. So using this process of supervision from the experts, it actually helps us understand how we vary these parameters and how we come to a conclusion.
So this is one of the consoles that we are actually using as an information system for the experts and the professionals. So what we provide here is we provide them information about different things that we feel need to be alerted for them. We identify different situations and we say, hey, here's an alert, here's a situation that's actually happening. Can you investigate and find us and help us more about this? So this is actually a system that Tony's team is designing for us. So that's something that might be interesting to look at. So where does this lead us to?
Analyzing Data Logs and Identifying Correlation
So after we've done the whole supervision process, one of the spots that we actually end up with is, as I said earlier, lots of data. We have multiple logs and we have lots of data that we need to analyze and build in some kind of a correlation. And that correlation is actually very important for us, primarily because most of this data that exists, most of the log data that actually comes out, is not usually strong enough for you to identify behaviors independently. So when you correlate them, usually across time you can figure out what kind of an event is happening. And that actually helps you identify incidences even better. So I'll give you an example in this case. So let's say you had a user login, and I'm going again with a login, but this is a slightly different use case.
Let's say you had a user login, which started off with multiple fails, and then you had one successful login. Then the machine that this user connects to attempts to connect to a database server. And then you see that there's a request made for the data to be dumped out of the machine and then the data gets moved back into the machine. Now, if you were to look at these events in, if you were to look at these events individually, then you would see multiple login attempts and one final successful login. We all do that. Everyone forgets the password every 180 days, we forget our password, we get a new one, we try, our old password never works, but we get in. Finally, that's the usual incident. The next thing that you see is the connection to a database. I mean, we all work with data. So the connection to a database is normal, it's not a big deal. And then the data dump. So let's say you’re creating a new table, you're just moving the data to a new table, perfectly fine.
And you're also drawing the data down from your database to your local machine. I'm sure quite a few of us do that, provided it's not sensitive data while we are trying to analyze data. Now, if you were to look at these incidents separately, you would see that there is nothing wrong with it, everything is fine. But when you put all of them together, that is when it actually starts to make sense. That is when you actually realize that, oh, this is an attack, this is probably an attack. And that is what we are actually trying to get to, is that we are trying to create this kind of intelligence which we are able to capture and say, if you were to look at all these logs, if you look at all this information that comes by in such a way that you know can figure out what is happening on your network, then you're pretty sure of identifying a malicious incident that might actually happen on your network. And that's exactly what we are actually going at and trying to design with it.
So what do we learn here? So as I said earlier, your anomalous behavior can be very well embedded in your natural behavior. Now, one of the things is correlation helps us identify these anomalous behaviors using these kinds of log event correlation. We figure out that we can identify these kinds of behaviors by observing a certain behavior that is the combination of these events in this larger context. And that's when we figure out that, oh, this is actually anomalous and that is the primary way an anomalous behavior is actually identified. It's not just our client detection, is what I want to say again.
Finding the Right Context to Identify Anomalous Behavior
So in my just trying to summarize the whole thing as to what I've spoken right now is what are we trying to find here? I mean, we're trying to identify the right context to identify anomalous behavior. And one of the reasons anomalous behavior is interesting is that most of the hacks that we see these days are not necessarily the ones that people have tried. It's not those kinds of people who take a script, run it on the computer and see if they connect to your machine, and download whatever they want. Now it's organized. People who are well funded, who know how to break into your machines and who do it very quietly and really well, I must say.
And identifying how we can correlate logs, that's another important thing that we have learned. And if you can transform anomalous behavior into some kind of statistical model, you can identify it, and that's a good thing. And you also have the blessings from the experts, so that adds value to it as well. So this is probably quite descriptive of what I wanted to say. I do want to thank everyone who's come by, all the support people and the open source members of H2O, and our H2O people themselves. And finally our clients. Really appreciate everyone. Thanks for making this happen. And before I close, these are the three people I work with. This is Mark Chan, literally a ninja. This is Ivy Wang. She's the one who designs our interface and that's Fonda Ingram. She's the one who helps us understand our entire requirements and stuff like that. So I want to thank the entire team out here and that's it. Any questions? Anything at all? That'd be great. Thanks.