The 3 Rs of visibility for any cloud journey
Dealing with an incident requires not just prompt notification of the incident, but the ability to triage the cause of the incident, the ability to carry out forensics, identify what other systems, users, devices and applications have been compromised or impacted by the incident, identifying the magnitude or impact of the incident, the duration of the activity that led to the incident, and many other factors.
In other words, notification of an incident is simply the first step in a complex journey that could lead to possibly unearthing a major cyberbreach, or perhaps writing off a completely benign non-incident.
While Security Orchestration Automation and Response (SOAR) solutions help automate and structure these activities, the activities themselves require telemetry data that provide the breadcrumbs to help scope, identify and potentially remedy the situation. This takes increasing significance in the cloud for a few reasons:
- The public cloud shared security model may lead to gaps in the telemetry (e.g., lack of telemetry from the underlying infrastructure that could help correlate breadcrumbs at the infrastructure level to the application level).
- Lack of consistency in telemetry information as applications increasingly segment into microservices, containers and Platform-as-a-Service, and as various modules come from different sources such as internal development, open source, commercial modules, and outsourced development.
- Misconfigurations and misunderstandings as control shifts between DevOps, CloudOps and SecOps.
- All the above coupled with a significant expansion of attack surface area with the decomposition of monolith applications into microservices.
When incidents occur, the ability to quickly size up the scope, impact and root cause of the incident is directly proportional to the availability of quality data, and its ability to be easily queried, analyzed, and dissected. As companies migrate to the cloud, logs have become the de-facto standard of gathering telemetry.
The challenges when relying almost exclusively on logs for telemetry
The first issue is that many hackers and bad actors turn off logging on the compromised system to cloak their activity and footprint. This creates gaps in telemetry that can significantly delay incident response and recovery initiatives. On occasion, DevOps teams may also reduce logging on end systems and applications to reduce CPU usage (and associated costs in the cloud), leading to additional gaps in telemetry data.
A second issue is that logs tend to be voluminous and, in many cases, written by developers for developers, leading to too much and perhaps irrelevant telemetry data. This drives up costs of storing and indexing that data, and to longer query times and more effort on the part of the incident responder sifting through that data.
Finally, log levels can be increased or decreased, but ultimately the logs themselves are pre-defined as they are embedded into code. Changing what information logs put out is not something that can be done in real-time or near real-time in response to an incident but may require code changes, leading to significant delays and impaired incident response capability.
The 3 Rs of telemetry
This leads us to the 3 Rs of telemetry – Reliable, Relevant, and Real-time.
To serve the needs of rapid response, telemetry data needs to be reliable, i.e., available when needed and without gaps introduced by malicious actors or even inadvertently by various operators due to misconfiguration or miscommunication. It needs to be relevant, i.e., it should provide meaningful actionable insights without significantly driving up costs or query times due to excessive, duplicate, and irrelevant information. And finally, it needs to be real-time, i.e., the stream of telemetry data can be changed, and new telemetry data or additional telemetry data can be derived at the click of a button.
A great way to complement logs in the cloud and address the three Rs is with telemetry data derived from observing network traffic. After all, command and control activity, lateral movement of malware and data exfiltration happen over the network. If end systems or applications are compromised and logging is turned off at the server or application, network activity continues and can continue capturing breadcrumbs identifying the malicious activity.
Network-based telemetry can provide a reliable stream of information even when endpoints or end systems are compromised or impacted. Metadata generated from network traffic can be surgically tuned to provide a highly relevant and targeted telemetry feed.
Security operations teams can select from thousands of metadata elements specific to their use case, for example focusing on DNS metadata or metadata associated with remote desktop activity, and discard other network metadata that may not be relevant, thereby reducing cost and (equally important) being able to write targeted queries. And, should the need arise to expand or change what telemetry data is being acquired, it can be easily changed at the network level without requiring any change to the application. A simple API call can change what network metadata is being captured in near real time.
As organizations look to move to the cloud, complementing their log sources with network-based telemetry will prove invaluable in bolstering their security and compliance posture. In that sense network-based telemetry is an essential component in securing the move to the cloud.