Multi-modal data protection with AI’s help
Cybersecurity risk is distinct from other IT risk in that it has a thinking, adaptive, human opponent.
IT generally must deal with first order chaos and risk much like hurricanes in meteorology or viruses in biology: complex and dangerous – but fundamentally unthinking – threats such as failed processes, degraded parts and other natural and manageable failures.
Cybersecurity, on the other hand, must deal with second order chaos and risk: a chaotic system with threats that intelligently adapt to defenses and countermeasures much like in warfare or espionage, but also like in other less martial conflict like sales, legal battles, and soon AI-assisted adversarial domains.
Normally, we think of this as cyber attacker and incident response defender, but there are big implications for data leakage and data protection, too, that demand a multi-modal methodology in defense and where AI applications are uniquely suited to changing the game.
It takes a thief to catch a thief
A friend recently had a hackathon where he challenged staff to get a unique data string out of a company. He gave points based on the ability to get that string out and, critically, on the originality of the method. If someone was the only person to use a method, that was worth more than if two or three people used it or if everyone used it. What I loved about this “test of porousness” was that it put ordinary men and women in IT in the position of the rogue insider, and the originality exploded, with implications for collaboration and innovation.
This exercise also highlights two important points.
First, there is a malicious mind behind the scenes thinking and scheming on how to change a given message for exfiltration. That string for exfil is not intrinsically tied to a medium: it could go out over Wi-Fi, mobile, browser, print, FTP, SSH, AirDrop, steganography, screenshot, BlueTooth, PowerShell, buried in a file, over messaging app, in a conferencing app, through SaaS, in a storage service, and so on. A mind must consciously seek a method and morph the message to a new medium with an adversary and their toolkit in mind to succeed and, in this case, to get points in the hackathon.
Second, a mind is required to recognize the string in its multiple forms or modes. Classic data loss prevention (DLP) and data protection works with blades that are disconnected from one another: a data type is searched for with unique search criteria and expected sampling data type and format.
These can be simple, such as credit card numbers or social security numbers in HTTP, or complex like looking for data types that look like a contract in email attachments. This approach, however, is not fundamentally tying together the blades or stepping back as a human with infinite time and patience would look across all communication types for the common, underlying signal buried in the various noise types.
Enter the artificial intelligence toolkit: machine learning (ML), deep learning (DL), large language models (LLMs), and more. To really protect data demands a multi-modal approach and to do that, the artificial intelligence toolkit must be deployed because human, carbon-based units don’t have infinite time and patience. It’s time to augment the human with silicon intelligence and to look for the errant signal from the intelligent minds seeking to exfiltrate through all those channels available to them.
Better monitoring through AI
The building blocks over the last decade that have led to a wealth and explosion of AI applications in writing, analysis, graphics, and targeted applications are nothing short of remarkable. Architectural, model, and data breakthroughs as well as insights into training are critical; but the big advances changed much of how research in general was done over the last five years.
Academic and practical research had applications not in silos of AI advancement but across all types. In effect, AI research itself became unified and more universally applicable among what were formally separate areas of investigation: text, speech, image processing and so on.
This means that multi-modal monitoring – applied AI that can look for base signals across media types – become possible. Rather than an application of BERT or GPT, LLMs and portions of the AI toolkit built and trained for the express purpose of finding exfiltrating, morphing, evasive data in not just high volumes of data, but highly diverse high volumes with active evasion in place. In other words, this is not our grandparent’s DLP.
What’s more, the technology can also be used to find application traffic and to parse types of interactions. What does that mean?
It means that not only data can be spotted, but certain conversation types can be spotted, too. So, for example, if a policy says that LLMs can only be used in a certain manner and under specific circumstances, multi-modal data protection can be built to spot all LLM-like conversations, with a high chance of discovering even those on unusual communications channels being actively obfuscated, and simple classic style filters can be brought to bear to allow-list when approved.