Machine learning fundamentals: What cybersecurity professionals need to know
In this Help Net Security podcast, Chris Morales, Head of Security Analytics at Vectra, talks about machine learning fundamentals, and illustrates what cybersecurity professionals should know.
Here’s a transcript of the podcast for your convenience.
Hi, this is Chris Morales and I’m Head of Security Analytics at Vectra, and in this Help Net Security podcast I want to talk about machine learning fundamentals that I think we all need to know as cybersecurity professionals.
AI has become very used within our industry more and more, and here at Vectra we are an AI company as well. As you start to hear more about AI, you have to start asking yourself what is it really, what makes a machine intelligent and in the next ten minutes I just want to give a quick overview so that you can understand some of the principle operations and applications of how machine learnings apply to build AI, and just kind of a quick understanding of the different algorithms or understanding when you need to use certain algorithms for specific jobs.
There has always been a very muddled use of the terms artificial intelligence, data science and machine learning. For simple definitions, the term artificial intelligence is applied when a machine mimics cognitive functions that humans associate with other human minds. These are functions such as learning and problem solving.
Where it gets confusing is that AI can exist without machine learning, and just a couple of decades ago, we used to talk quite often of something we call an expert system. Data science doesn’t need to use machine learning, machine learning is really just the use of algorithms to attract meaning out of that data and machine learning expert systems differ in the quantity of human knowledge needed and how they are used. What I mean by that is that, in an expert system, the full knowledge of the expert is digitized and used in the decision making, and it ends up where the machine starts looking like a series of if then statements.
Machine learning is all about having an algorithm learn from the data rather than having a human encode the logic such as that example and with machine learning people constantly talk about AI that their jobs are going to be lost. The reality of it is that there is plenty of work for humans to set up data, they have to label data, they need to select features for these algorithms and then they have to tune it to achieve the desired results. In other words, it’s not magic.
In machine learning, a model is a function which learns to predict or classify by learning through input examples. We call these examples data sets and each point in the data set is of the form X and Y, where X is the input and Y is the output. What happens with machine learning is that the models go through the entire set of data and it learns that data. If you provide a new in point, say X here, the model can tell you the Y. From X it understands Y and the model doesn’t do this simply by memorizing the example data, but it actually learns the relationship between each X and Y within the data. What we mean is that it starts to understand the behaviors and what things look like and what they do instead of what they are.
When we talk about machine learning, the two major types you will tend to see are supervised and unsupervised, and then there is also the concept of deep learning, and especially in security, these are the three that matter the most.
In supervised learning, if we talk about the X and the Ys, the data set has both X and Ys, which means that a data scientist already understood the inputs and the outputs and what the expected results were. He just wanted to train the data to learn from that so that it can find more of the same. In unsupervised learning there is no Ys provided for learning and the data that the data scientist work with has only Xs which is the inputs, so they don’t know the outcome. In those scenarios, unsupervised learning has to learn on new things in real time in certain environments.
Deep learning is a model of learning that can be applied to either supervised or unsupervised, it’s not specific to either, but it’s used to make the machine smarter and we will go into that in a minute. I think the best thing here is to better understand the types of machine learning. I will try to draw an analogy with real life human learning examples. I’m not going to cover all topics for every type of algorithm but just some key high-level terms related to a machine learning that we think of our interest and specific to cybersecurity.
The first kind, when we think of supervised learning, going back to that, here is a real-world example. You’ve seen people smoke and they get sick or they get cancer. By that observation that you made as a human, you make a decision that neither you nor your kids will ever smoke because you learnt in life that smoking kills. Mathematically, as a human you have observed a ton of data, which is lots and lots of smokers and that they get cancer and they die, and you came up with the rule for classification. You decided that a certain characteristic means class A else class B. That is a description, what we call, supervised learning and specifically classification where what you are trying to do is predict a label.
The other type of supervised machine learning is regression learning, which is predicting the quantity. Fundamentally, classification is about predicting a label and regression is about predicting the quantity, but they are both supervised. An example of regression is, say you get the data for a house price versus the area of where all these houses are, and then you sit down and you plot it on a graph and this is a pretty common thing people do when they want to buy a house. When you look at the data you see that it’s almost a straight line and you draw a straight line. Now what happens is that when you look at this graph you drew with all these houses placed on it, you can predict the price of a house as someone tells you the area so that you start to understand the right areas you want to live in or where you want to buy a house because neighborhood A is a million dollar homes and neighborhood B is five hundred thousand dollar homes without knowing every single home, and with the data you had you can make predictions on future homes in those neighborhoods. This is exactly how real estate insurance and things like that actually work, they predict these quantities. The difference between a human and a machine is a machine does it in a more mathematical and formal way, where humans use natural intuition.
The next example, and those examples are for supervised learning, the next area of machine learning that we care about in cybersecurity is unsupervised learning. When we talk about unsupervised machine learning we primarily focus on the concept of clustering. Clustering is creating groups of clusters based on similarities and examples. To give you another real world scenario for that and how we do it as normal people, let’s say that an advertising platform (we all understand advertising – Google does it, Facebook, all the major companies) wants to segment the population into smaller groups for similar demographics and purchasing habits, and they do this so that the advertisers can reach a specific target market with relevant ads.
Vectra as a cyber security company, we advertise with Google and we go to Google and say “hey, we are looking for people who are interested in cybersecurity or looking for incident response” or things like that. We gave a certain criteria and then Google will take those criteria and start to cluster and take these groups of people and say “I don’t really know who these people are, but I know some habits about them, so I can make a guess that these people are interested in security and that’s what they give us”. What happens here, from a machine perspective, is that mathematically you are faced with the task of grouping unlabeled data, and rather than finding the group before looking at the data, clustering allows you to find and analyze the groups that formed organically.
Then the last area, and then I want to just kind of give a quick example of how this applies to security, is deep learning. The major thing we care about here is a concept called transfer learning which means that I teach the machine a task and it reuses as a starting point for another model on a second task. In the real world an easy example that is: say your mom tell you how to choose oranges, she sends you to the grocery store to buy apples that you don’t know how to, you can use some common sense to make this decision, or you know how to play tennis, it shouldn’t take you long to pick up squash. This is essentially what an HR of a company terms as a transferable skill, and it’s this skill set that allows us to use deep learning to start to learn about new tasks that we never solved based on prior knowledge.
Just as a comment to tie this back to security, as we are coming to the top of our ten minutes here, is that all these examples of learning represent a fundamental shift in how we are able to perform security and, unlike a signature based approach that delivers a one for one mapping of threats to countermeasures, data science and machine learning uses the collective learning of all the threats observed in the past and the present and as things happen to proactively identify new threats that haven’t been seen before.
Just for one more example, think of it as a student learning a new subject at school. Memorizing the answers to a test might result on a passing grade but that approach misses the mark when it comes to learning how to solve a problem and that’s where we have been in security with signatures. It’s a critically important distinction when using data science to detect threats.
Long-term it’s essential to understand what, when, why and how and actual knowledge in intelligence is far more advantageous when evaluating and solving new problems that have not been encountered before, and for the traditional signature model to continue working and secure how we have been doing it, all the answers have to be known ahead of time which is absolutely unrealistic. For example the domain atme.com has been seen behaving badly in the past, therefore it’s bad.
Data science expects to be asked real questions and it applies collective learning to evaluate an unknown. For the collective knowledge of threats gathered from real world it’s possible to identify the domain is acting bad based on its behavior rather than saying I saw it in the past. A scenario for that is atmeonetwothree.com has never been seen behaving badly in the past but traffic tuned from it showing different behaviors. Those combinations the machine learns or actually command in control with no prior knowledge.
To learn more, go to vectra.ai and we have a number of resources there. There is a twenty minute video from a presentation I gave at Infosecurity Europe about this topic and also in our blog series one of our data principle scientist has been writing a series on understanding AI that explains the mechanics on how we do it and how we use it here, and if you want to reach out directly you can reach us at info@vectra.ai and with that thank you and have a good day.