What is a security data lake?
The concepts of the data lake and the specialized security data lake are relatively new. While data lakes have a bit of a head start in adoption – largely among data science teams – some security teams are beginning to look into security data lakes to keep afloat in the wash of security log data they amass every day. Understanding the capabilities and differences between the two types of repositories will help determine if implementing one is right for your organization.
What is a data lake?
A data lake is a repository designed to store large amounts of data in native form. This data can be structured, semi-structured or unstructured, and include tables, text files, system logs, and more.
The term was coined by James Dixon, CTO of Pentaho, a business intelligence software company, and was meant to evoke a large reservoir into which vast amounts of data can be poured. Business users of all kinds can dip into the data lake and get the type of information they need for their application. The concept has gained in popularity with the explosion of machine data and rapidly decreasing cost of storage.
There are key differences between data lakes and the data warehouses that have been traditionally used for data analysis. First, data warehouses are designed for structured data. Related to this is the fact that data lakes do not impose a schema to the data when it is written – or ingested. Rather, the schema is applied when the data is read – or pulled – from the data lake, thus supporting multiple use cases on the same data. Lastly, data lakes have grown in popularity with the rise of data scientists, who tend to work in more of an ad hoc, experimental fashion than the business analysts of yore.
How is a security data lake different?
A security data lake is a specialized data lake. A security analyst could certainly pull from a generic data lake built for multiple applications, but several things would prove more difficult.
Every security product, network device, endpoint computer and server creates its own logs. In some cases, security products like DLP and IPS, also store a copy of network and endpoint logs. To perform an investigation, a security analyst would need access to logs produced by all the relevant systems – from the wireless router to the endpoint computer and the server accessed and the DLP application – in the case of a user suspected of accessing a system without permission. Centralizing all relevant logs in a security data lake simplifies the investigation by reducing the work of collecting logs from multiple systems.
To collect all of this information, a security data lake needs to connect to and parse many different types of logs. With hundreds of security solutions on the market, not to mention all the networking device types, this can be a daunting task. A security data lake automates the connection, via an API or other method. The data lake also automates the processing of the data when loaded (known as parsing), and the schedule on which the data is collected.
Lastly, analysts often need additional context to perform an investigation. Details like user location, device type, and job role help an analyst understand what the user was attempting and whether there might be a legitimate reason for accessing certain systems or data. A person from the sales team accessing a server in finance might cause alarm, unless that person is in sales operations and calculates commissions. A security data lake will append, or enrich, log data with this kind of additional information.
Key capabilities of a security data lake
Here are five key capabilities security buyers should look for in a security data lake:
1. Automated collection: With hundreds of commonly used security, networking, computer and mobile device types in organizations, an automated collection process is the only practical way to keep the data up to date. It is not uncommon for large organizations to have billions of security-related logs per day. Unlike the periodic tasks of the data scientist, the security analyst needs all logs, every day.
Automation requires a method to schedule the data fetch, e.g. via an API call, or accept a data push from a given system, via security protocols like syslog, NetFlow, and Cisco eStreamer. Once the data is received, it must be parsed. A large parser library is essential, along with support for the wide range of security protocols used across security applications, networking devices, computers and devices.
2. Security context: The time-series data found in log files is verbose but lacks the organization and context an analyst needs. A security data lake helps organize log files and enriches them with important contextual information. For example, to a WiFi router connection log event, a security data lake would add device type, geolocation and job title. Someone logging in from an unknown computer, or from a distant location, might cause an analyst to raise a red flag, unless perhaps that person is a salesperson who travels frequently. Device information alone is not sufficient. Insider threats are often detected base on the role of the user; a developer accessing HR files could be deemed suspicious.
3. Hostname-to-IP mapping: IP addresses are typically assigned dynamically. A WiFi router in an office, for example, will assign and reassign the same IP address to multiple machines, sometimes in the same day. Though it may sound like a very tactical requirement, it is essential. Tracking down malicious insiders or criminals who have breached a network requires knowing which user was assigned to which IP and at what time of day. Without mapping addresses to machines at a given time, even vast numbers of logs in the security data lake will be largely useless.
4. Security analysis and reporting interface: The types of research done by security analysts is quite different from that done by data scientists. Security analysts are usually trying to demonstrate compliance, look for risky behavior, or investigate a breach that has already happened. For this they need search, alerting and reporting capabilities built into the data lake. SOC managers cannot expect analysts to master query languages or specialized analytical languages like R. Security data lakes need to provide a simpler way for analysts to search and understand the information contained within them.
5. Scale out architecture: While all data lakes need to scale, it is especially important for security data lakes. Why? The sheer volume of data ingestion and the required retention. Analysts need access to all security events in order to recreate timelines. Also, depending on local laws, industry regulations, and audit practices, organizations may be required to retain log data for months to years. Scaling out to a multi-note cluster, rather than a larger machine, has the advantages of a virtually unlimited storage capacity and a fit for flexible cloud deployment.
Security data lakes hold the promise of helping security analysts become more efficient in performing an incident investigation or hunting for threats. Knowing what to look for is an important first step in improving security. Without this specialized technology, even the most skilled analyst risks drowning in a sea of data.