Machine Learning - for fun, profit and (in)security

Machine learning is not a silver bullet, but it is an incredibly powerful tool in both a defender and attacker's arsenal. 

There is an immense shortage in cyber security talent, although some have said that is a myth - a repeat of classical patterns of employers expecting too much experience and an unwillingness to pay market rates, and lack of willingness to take on and train new talent.

This skills gap, or mismatch, is combined by a rapidly changing surface area that needs to be defended - IoT devices, cloud services, mobile add to the complexity of a difficult problem inhabited by determined adversaries ranging from hactivists to criminal syndicates to nation states.

Machine learning has been heralded as a solution to this problem - cyber security is rife with big data examples - millions of samples of malware, billions of log file entries, trillions of packets of network data, hundreds of thousands of threat indicators. The prevailing theory is that (a) it's too much data for humans to churn through (true) and (b) that machine learning should be able to move beyond simple correlation rules offered by base SIEM technology to find the proverbial attacker needle in a haystack.

Taking time to learn about unassisted vs. assisted machine learning, Markov models, baselining and anomaly detection will help you wrap your head around this space - people as far flung as Elon Musk and Bill Gates are worried that artificial intelligence is an existential threat to humanity, and others like Mark Cuban believe if you aren't learning about it you will end up a dinosaur on the job market.

Machine learning needs to be applied thoughtfully to cybersecurity - you need a good dataset of both positive and negative information, otherwise it is difficult to train current systems. The challenges around sharing good data here abound - legislation has been introduced to help make information sharing easier, but there are legitimate privacy concerns given recent rulings that even a solitary IP address can be considered personal information.

So, how can machine learning be applied to cybersecurity?

1. Advanced detection and correlation techniques. 

Using assisted and unassisted machine learning to sort through reams of data in order to find both outliers (anomalies) and indicators of attack based on how trained analysts looking at the same data find indications of compromise.

Using reams of code and behavioural analysis to counter polymorphic malware techniques leveraged by attackers. It doesn't matter if the individual piece of malware is new, you can piece enough together by how it behaves on the system or how similar/dissimilar the code is from other samples of good/bad files. Fundamentally this is at the core of most 'next-gen' anti-malware solutions, and is a natural extension of previous mechanisms leveraging heuristic analysis and simple code and behavioural signature techniques.

2. Analyst enrichment

The traditional security analyst path is to be provided an alert - a piece of malware popping up on a AV console, a correlation rule in a SIEM, a firewall rule trigger - and then pivoting from that data point to conduct additional investigation and eventually take appropriate remediation actions. With machine learning we can monitor analyst behaviour and start automatically providing relevant data without being asked, and eventually pivot into suggestion remediation actions and then automated remediation. We've also seen examples of interactive analysts assistance, like an IBM researcher who's attempting to create a Jarvis-like project called Havyn, focused on cyber security.

3. 'AI Ops'

One of the hottest catch phrases being throw around security development circles these days is 'AI Ops' - the idea that you can use machine learning to automate operations activities, including remediation of security issues. Examples of this being put in place are SIEM technologies that take the output of a threat indication (malware infection), and have 'learned' a remediation process such as reaching out to a workstation, killing a malicious process, open/closing tickets, all without human intervention.

4. Proactive network defences

Organizations like ZenEdge are starting to apply machine learning to complex problems like active web application defence. Intrusion prevention systems are another technology, like WAFs, that have traditionally been very labour intensive requiring massive ongoing tuning and training efforts. Machine learning against large data sets is a natural place to reduce tuning activities.

5. Forensics and discovery

Searching through massive hard drives for signs of data leakage is a very labour intensive activity. By training machine learning algorithms to look for specific patterns of compromise, typical sensitive data patterns, manual effort required to perform discovery can be massively reduced.

6. Vulnerability assessments/penetration tests

Combining output from traditional vulnerability scans with datasets around successful exploits and compromises have been used to reduce the amount of time used to demonstrate target penetration. Unfortunately, this is also a technique commonly used by attackers. Automated scanning for exposed hosts, mapping software, determining versions and mapping to exploit databases, combined with scripted attack, fuzzing, other techniques have reduced the attacker time to compromise significantly. Unfortunately this means that time to compromise exposed vulnerabilities falls from months to hours or even minutes.

In the never ending battle between attackers and defenders, we know that attackers will take advantage of every technological advance possible - polymorphic malware, advanced evasion techniques, encrypted command and control channels and soon machine learning techniques for network penetration, targeted spear phishing attacks and more. Defenders need to up their game to deal with the rapid speed and evolution these attackers will be able to drive - leveraging machine learning is one step in a long game.