The role of streaming machine learning in encrypted traffic analysis
Organizations now create and move more data than at any time ever before in human history. Network traffic continues to increase, and global internet bandwidth grew by 29% in 2021, reaching 786 Tbps. In addition to record traffic volumes, 95% of traffic is now encrypted according to Google. As threat actors continue to evolve their tactics and techniques (for example, hiding attacks in encrypted traffic), securing organizations is becoming more challenging.
To help address these problems, many network security and operations teams are relying more heavily on machine learning (ML) technologies to identify faults, anomalies, and threats in network traffic. But as encrypted traffic increasingly becomes the norm, traditional ML technologies need to evolve as well. In this article, I’d like to look at the type of ML models being used today and explore how they can be paired with Deep Packet Dynamics (DPD) technology to gain visibility into threats that could be hidden in encrypted traffic.
To be successful with ML, NOC and SOC teams need three things: data collection, data engineering and model scoring.
Data collection involves extracting metadata directly from the network packet stream. Data engineering is the process of moving raw data to the right place and transforming it for input to a model. This includes tasks such as data standardization and feature creation. Model scoring is the final stage where ML algorithms are applied to the data. This includes the necessary steps of training and testing models.
Historically, ML has relied on batching models. With garden-variety big data, traditional data pipelines work quite well. Models are trained offline using historical, retrospective data. Later, it’s deployed on data that’s been saved for analysis.
It works something like this: First, the team creates a highly engineered data pipeline to port all data back into a massive data lake. Next, historical features are created by running queries and pre-processing scripts. Finally, the models are trained on the large collection of data. Once ready, the trained model is moved to production, which requires translating every data processing step to an outward-facing application.
The cost of storing and processing heavy data (which is “big” data that requires specialized tooling for storage and processing, and is not stored in traditional database record formats) like network data can be prohibitive. This method of ML requires significant scaling and resources. It’s useful for model development and predictive models with a large time horizon.
However, as network traffic has grown there’s a newer alternative called streaming ML. It utilizes a much smaller resource footprint while exceeding the performance requirements of the highest bandwidth networks. When combined with encrypted traffic analysis, organizations have a powerful tool that provides visibility into network threats. Historically looking into network traffic was done using Deep Packet Inspection (DPI), but as more of that traffic is now encrypted, it’s becoming less and less useful. This has driven the market to a new technology called Deep Packet Dynamics (DPD), which offers a rich metadata set done without payload inspection.
DPD features include traffic characteristics such as producer/consumer ratio, jitter, RSTs, retransmits, sequence of packet lengths and times (SPLT), byte distributions, connection set up time, round-trip time, and more. It offers superior features that are well-suited for ML and are effective in identifying patterns and anomalies that simple and enhanced approaches fail to catch. But they cannot be computed retrospectively, they must be captured as the traffic streams through in real-time. This form of cryptanalysis reinforces privacy by eliminating the processing intensive man-in-the-middle (MITM) technique of decrypting and inspecting traffic.
As a result of combining streaming ML with DPD, SOC and NOC teams can more easily detect advanced threats in real-time. This approach can, for example, uncover ransomware attacks underway on the network including lateral movements, advanced phishing and watering hole attacks, insider threat activity and much more. This approach also eliminates encryption blindness and restores visibility for network defenders.
By 2025, nearly all network traffic will be encrypted. As encryption grows (along with new threats), organizations must rely more heavily on streaming ML (including machine learning engines) and encrypted traffic analysis to gain the necessary visibility into anomalous traffic. Without it, attackers will continue to bypass traditional security mechanisms, hide within encryption, and successfully complete attacks.