Using data science to improve threat analysis | AT&T ThreatTraq

data science and machine learning in threat analysis

Every week, the AT&T Chief Security Office produces a set of videos with helpful information and news commentary for InfoSec practitioners and researchers.  I really enjoy them, and you can subscribe to the Youtube channel to stay updated. This is a transcript of a recent feature on ThreatTraq. Watch the video here.  The video features Jaime Blasco, VP and Chief Scientist, AT&T Cybersecurity, Alien Labs, Brian Rexroad, VP, Security Platforms, AT&T, and Matt Keyser, Principal Technology Security, AT&T.

Jaime: Today we are going to talk about how machine learning is being applied in cybersecurity. We will also be discussing how data science can be used to improve threat analysis and threat detection.

Brian: All right, Jaime. Based on this discussion that we already had, maybe you can take us into a little deeper on how you are working with, you know, data science and machine learning in the area of threat detection and threat analysis.

Jaime: Absolutely. So one of the things that I want to start with is clarifying some misconceptions. In the cybersecurity industry, you're seeing many players talking about using AI and machine learning. Those two words you're going to see people using them in the same context but I wanted to clarify a little bit about what that means. For me, artificial intelligence is more the broad field and within artificial intelligence, we can talk about general artificial intelligence and narrow artificial intelligence. General artificial intelligence is something that doesn't exist yet. Right. We haven't been able to create an artificial intelligence that is able to generalize and reason as well as or better than humans. So, when we talk about narrow AI,..that's what machine learning is. It uses model that are able to solve a particular, really well defined problem.

Matt: Right now, we have a very narrow definition of functional artificial intelligence. And machine learning is one version of that, one technique that might be used to teach a machine how to solve a problem.

Brian: You know what, I think what the next stage that we need to get to is using artificial intelligence to figure out how to apply artificial intelligence. I mean, quite frankly...that's where it has to be and it's going to continue to be iterative to get deeper and deeper,.

Jaime: I totally agree. If you see some of the latest research from Google and others, the field of AutoML, is really popular with a lot of investments happening. For those of you that don't know what AutoML is, as Brian said, it's basically training a neural network to come up with new neural networks or novel architectures.

Brian: That will be the path to singularity in my opinion.

Jaime: So we can divide machine-learning techniques mainly in two categories: supervised machine learning and unsupervised machine learning. There’s a third one, reinforcement learning that we are not going to talk about today because I still haven't seen many use cases within cybersecurity.  We talk about unsupervised machine learning in the area of anomaly detection or data exploration. And a point that I want to make there is we have many cyber security products out there that are applying unsupervised learning, including clustering, anomaly detection, etc. I'm not a huge fan of those algorithms in the cybersecurity context because they are prone to many false positives.

Matt: Things that are just clustering and finding things that are similar won't necessarily find you something malicious. That's when you need to apply a model that has trained data where someone else has already gone through and done the work and shows the model. This is what you're looking for.

Jaime: The most successful problems for machine learning models are going to be those where you have access to a very high-quality data set that has been trained and labeled and where you can apply a supervised learning model that can make predictions about that dataset that you have trained on.

Matt: To me that raises an interesting question, because I'm thinking about you guys have done the work on your side to train it for your particular data sets. The existence of a really good training data set kind of implies the existence of an expert in that data set already. So someone just had to take a look at it and categorize it and tag it up.

Do you find that categorizing and tagging up a data set from, say, one organization gives you a good view into...if you were to use that exact same model in a different organization? Like, take two Fortune 500 companies with very different networks and very different structures and activities. Do you find that the model from one applies well or is there extra training that has to occur before it becomes useful?

Jaime: I will say it highly depends on the use case. And, when thinking about that problem, one of the approaches that some companies are taking is using their customers to train those data sets. So we were talking about how not many people are really qualified to train these models. Right? So,  many customers out there are going to be making bad decisions and, in the end you're going to end up with a model that is trained with data that is noisy or with incorrect labels. In the end, that's going to be even worse. So you really need to make sure that the quality of your training data is extremely, extremely good. Otherwise, it's going to be hard.

Brian: Yeah. I think this kind of feeds into that concept of talent amplification that we talked about a little bit earlier.  It is, "How does the data get labeled?" The talent that knows how to find the relevant things labels it. Now I think there is the notion of active labeling, that is, you don't have to have a historical data set that is labeled. You can have your current data set being labeled by folks that are doing analysis and providing a feedback mechanism into the machine learning so that future events that look like that will be flagged as potential security events.

And then it's an iterative process. That is, you need to have the talent continuing to feed back into the machine learning so it starts to learn what the talent recognizes as being relevant. There are tools that can be used to solve these problems, but it requires technical skills to apply those tools and solve those problems today. It's not as if you can just stick somebody in front of it and they figure out how to solve the problem.

Jaime: I wanted to share with you one particular example of how we are using machine learning today. At AT&T Cybersecurity we collect a huge amount of malware samples per day. We have a system that automatically collects those samples, performs static analysis that is, “What do we know about the file? What can we learn from the file without actively executing that file?" And then we have a second system that is doing what we call dynamic analysis. That is, executing the malware sample in a sandbox and then analyzing what happens when we execute that sample.

As you can imagine, that's a huge amount of data. Right now we are doing about 200,000 samples per day. And we are storing all that information and, thanks to that, we are using machine learning in a couple of places there. One of them is actually using unsupervised learning to cluster malware families. So, to give you an example, if I have to give my team the task to analyze 200,000 samples every day and they had to go and manually verify whether that's a new malware family - I would really need like a thousand threat analysts.

We are actively using some unsupervised techniques to basically generate clusters of activity. At the end of the day, what I give my team is a set of, let's say, 10 clusters of malware samples that exhibit a similar behavior. Right? As Brian was describing, a human will have a really hard time doing this because it's high dimensional data. You're talking about thousands of different features that you have to look at and a machine is really good at establishing those relationships. And this particular technique has been saving us a lot of time.

Brian: And from that, you would get more consistency from the machines than if you have an analyst doing it. They're probably going to weigh things differently … "Oh, this kind of looks like this," but maybe there are other attributes that are hidden that are really like that. At least you'll get consistency in how they get clustered over time when a machine is doing it.

Jaime: Absolutely. And that's actually good as you said. You need some consistency there, especially when you are looking at writing IDS signatures out of those behaviors. The second example I wanted to share with you is actually using the same data set. A problem that we are looking at right now is deciding when to execute malware samples in a sandbox. It’s actually costly in terms of resources and in terms of time because you may want to leave those samples executed in there for a few minutes or sometimes longer. So you have to have enough computer resources to store them. Being able to predict the behavior of a malware sample is valuable.

As you can imagine, we can use that data set of all the static features that we extract and previous dynamic analysis to see what happens when we execute the sample. So in the end, we can create a model that can say, "Based on previous observations, the probability of this sample connecting to the internet is 90%," or, "Based on the static features, the probability of this sample actually exploiting a particular vulnerability in the endpoint is going to be X." So, that way, it's actually letting us use those resources in a smarter way because we can avoid many of those analyses that are not going to be driving some of those malicious behaviors that we are looking for.

Brian: So in that case, you're predicting the behavior of something you have in hand. It's not as if you're trying to predict the future. Right?

Jaime: Right.

Brian: Very good. And those are very good applications of machine learning technology that really demonstrate a lot of the topics we discussed earlier in terms of talent amplification, scalability, consistency, or accuracy in the analysis.  
Matt: Cool. I feel like we're on the right track and that's a comforting thought, that we understand that there are limitations to what can be solved with ML but we're also applying it in interesting and unique ways that hopefully will pay off.

Share this with others

Get price Free trial