Behind the scenes of the cleanest ISP in the world
The malware and botnet problem has been around for quite a while, and definitive solutions for it are still eluding the Internet and mobile communication industry, as well as the security industry.
The Chairman of the US Federal Communications Commission has recently made known that he considers ISPs a crucial factor in the fight against botnets and has agitated for the implementation of a voluntary code of conduct aimed at keeping their customers and the Internet infrastructure safe from various threats.
The plan is practically foolproof, as Swedish telecommunication company TeliaSonera already proved by implementing all actions included in the code years ahead of it and consequently reaching the status as one of the cleanest ISPs in the world.
Arttu Lehmuskallio, Security Manager of TeliaSonera’s CSIRT in Finland, shares details about the evolution of his company’s automated monitoring and alerting system, the problems they had to face in its various stages and the solutions they came up with.
Although every ISP in the world has to battle malware, TeliaSonera is regarded as being the “cleanest of the clean.” You earned this reputation for safe computing by creating an automated monitoring and alerting system to identify infected devices, alert their owners, and remove the devices from the network until cleaned. How did the idea of this system come about and why do you think other ISPs are not doing the same?
Back in 1999 I started working on a team that, among other things, handled the abuse cases. Back then we didn’t have any alerting systems, no abuse handling systems, no ticket systems. It was only about reading the abuse mailbox and reacting to cases on case-by-case basis by manually browsing the logs and notifying customers and/or shutting customers’ connections. Our mindset was and still is that we’ll handle every single case. When talking about abuse of our customers, handling consists of three things:
a) determining whether the source information is legit
b) identifying the customer behind the IP address + timestamp
c) mitigating the source of abuse.
In 2001 we had 1000 cases. In 2002 we had 2000 cases. In 2003 we had 130 000 cases. You can imagine when the idea of an automated system came about.
As for your second question, I really don’t know what all the ISPs of the world are doing and to what extent; we’re just doing our thing and it seems to work. We have no data of our own to be able to compare ourselves to other ISPs, so I enjoy reading 3rd parties stats and figures that always seems to point that Finland is doing a great job. This that indicates that plenty of other ISPs out there are not operating in a similar fashion.
My personal favorite of those studies is “The Role of Internet Service Providers in Botnet Mitigation: An Empirical Analysis Based on Spam Data”, 2010/05, OECD Publishing, available here.
It would actually be quite interesting to see the data drilled down even further, so instead of comparing countries, we would be comparing individual ISPs.
A lot of the public debate has been circling around whether handling abuse is the ISP’s responsibility and whether the whack-the-mole game is the right approach or whether it is actually counter-productive in the fight against Internet crime. While these are interesting debates, we’ve always felt that this is really about the quality of the services that we provide.
When a customer has his box hacked, the faster we’re able to mitigate the situation, the better service we provide. When we’re talking about consumers, cutting the connection or pushing the connection to the walled garden is really not about removing our customers ability to use the Internet but rather removing the criminals access to our customers system. We feel that we’re essentially providing a service to our customers and if we make the world a better place while providing that service, it’s a nice additional benefit. I know some ISPs share this view while others don’t. That’s life.
There’s also the fact that a lot of the malware-infected customers have a worse “Internet experience” and they are likely to blame their ISPs for it. Take something like DNSChanger for example. For us, all customers infected with it have their DNS servers on the opposite side of the planet. That causes some latency issues. Also, when I read something like the “FBI’s Internet Blackout Postponed from 8 March to 9 July”, it makes me chuckle a bit. I checked our share on 7 March and we had two (2) customers reported to us that day, so “Internet Blackout” doesn’t really apply to our customers.
What were the most significant challenges in setting up a system that will closely interact with users and warn them about infections? What were the major obstacles your encountered during this process?
Every new thing we manage to get running is aimed at making our work load a bit easier.
Creating a system with a group of people that had never created any systems whatsoever was a challenge in itself. Had there been a commercial solution available at the time, we would have probably taken that route, which would have been a terrible decision that we would most likely still regret. The thing we have come to realize in the last 10 years is that our system will never be finished – we constantly need new features more or less immediately in order to react to new threats or new products of our own. If we had a commercial system, we would have to wait for an update for who knows how long. We started building the system one step at the time. Sometimes we got the features to production within hours while some others took years to accomplish.
To give an example, back in 2002 before we had anything, the most cumbersome task was when we had, say, a spammer, and we might have 100 complaints in the abuse box in the morning. I would take a single email, check the IP address, search the inbox for that single IP address and move all emails containing it to a temp folder. Because new complaints would be pouring in steadily I would shout over my cubicle that “I took 1.2.3.4, don’t touch that”. Going through the radius logs, DHCP logs, or whatever was applicable for that case to determine the right customer took around 20 minutes.
Even when we had already shut the customer, complaints of previously sent spam would keep on coming for weeks to come, so we had to remember that 1.2.3.4 had already been handled. Then we had the additional problem of dynamic IP addresses. We had to go through the 20 minute trouble multiple times only to find our that the same customer was behind a lot of the different addresses.
So, the first thing we did was automate the log browsing part by putting DHCP, radius and other logs into a database that our system could use to resolve the customers behind the IP addresses, then we opened up APIs to our customer management systems to get the customer information, choosing credible sources of intel to automate and integrate to our system and creating a webGUI for the handling part. When we had that done, we only had a single case instead of a hundred emails with multiple IP addresses in a messy mailbox. I would have a single row in our handling system saying Customer ID12345 and all those emails would be behind that link.
The next step was to notify the customer and/or shut the connection. Notifying the customer was easy because we already got the customer information automatically. Shutting the customers with a button was a bit trickier, but we started with the connection types where we had the most incidents.
The most tricky part was to get the connections up and running again – how would the customer be able to inform us that the box has been fixed. So we opened our system to our customer helpdesk and gave them the “unshut” button. When we realized that instead of copies of Viagra ads and firewall logs they really needed descriptions on each malware type, links for more information, etc. so they could pass that information to the customer, we started providing them exactly with that.
But this is all ancient history and was finished back in 2003. There’s always something new popping up that we need to adjust to. Right now we’re struggling with NAT a bit.
The trickiest part was and still is to detect the malware on our own instead of depending on 3rd parties. The most challenging part on that has been making the solutions compatible with our legal framework. One trick we got into production about four years ago is something I like to call a reversed darknet. With it, we more or less detect 100% of worms and other malware that try to scan the network. In a nutshell, we log all outbound traffic where the destination is not found from the routing table at that time.
When our customer has enough traffic towards unannounced IP space, the evidence is pushed as an incident ticket against that customer. Even in the current IPv4 space, there’s still plenty of unannounced space practically behind every /8, so any malware trying to scan, say, random 10 000 addresses per hour will get caught thousands of times during that hour. We even detect malware trying to spread solely within internal networks, because customers tend to route the private IP address space not used by themselves up to their ISP.
How many people work on the team dedicated to fighting infections on the endpoint and what are their roles?
Our CSIRT team consists of five security specialists. None of us dedicate our work solely on customer infections, but rather see handling them as a part of our teams basic activities. Customer incidents contribute to our “other job”, which is to handle all internal IT security incidents, because instead of having to prepare for new threats by reading about them from media, we’ve usually already seen them targeting a customer of ours. We also detect our own infected workstations with our system, which is a nice additional benefit.
One person is mainly responsible for running the system and bringing new features to production, though he’s running plenty of other systems not related to customer abuse as well. Additionally, most of us can code so we all contribute. Handling the customers is done by all of us, but it probably takes less than few hours a day altogether. I mean, when we get information from a credible source that our customer is, say, infected with Zeus, it’s just a matter of clicking the “Zeus” -button. That takes a fraction of a second – and even that could be automated if we wanted to, but we’ve decided against it for now.
We don’t have a dedicated helpdesk for these cases. When a customer needs support, the case goes to anyone that happens to answer our tech support number. There’s the additional benefit that our helpdesk is more “security aware” than most as they are always reading about the latest threats. When the helpdesk needs help, we have an internal IRC server and a channel dedicated to this.
We don’t help our customers in removing the malware, that’s up to them. We just provide the evidence and tools, e.g. logs, links, a trial AV, etc. If they need help removing the malware or reinstalling their Windows or whatever, they can use any service they want, be that the geek cousin or a commercial helpdesk. We do have a subsidiary company providing commercial helpdesk services, so that’s one option. The connection is returned to normal the moment the customer informs us that he’ll take care of it. No additional checks on our part as we, on principle, choose to trust our customers. If they fail, they’ll end up getting notified and/or finding themselves from our walled garden again.
How did you start working with Microsoft? How do you complement each other in this cooperation?
Microsoft started sending information about the detected Rustock infections about a year ago. We appreciate anyone dedicating their time and effort in sending us information and had no reason to believe the data not to be accurate, so we automated their notifications more or less immediately. Additionally, I’ve been reading SIR for years and always took pride of the fact Finland was listed as the country of least infections. I trust SIR numbers as it’s one of few sources that can actually measure the infections per computer instead of per IP addresses.
Microsoft had included more detailed view on Finland on their latest SIR report and visited Finland last October to talk about their findings. After the event I approached Tim Rains and talked about how I would be grateful if they could provide us our infection numbers instead of “Finland”. That evolved to the case study.
How we complement each other – well, I think we share the same vision and interest. There’s a lot of malware that we’re not able to detect, so we need to get that information from external sources. While there’s a great security community out there that helps us fight the good fight, there’s also a lot of companies that have plenty of data they just won’t share. Microsoft does share and that is something we appreciate. In the end Microsoft wins because there are less infected Windows users, we win because our networks are cleaner, the Internet wins because there’s less malicious activity. But the real winner is the customer who won’t get his personal data, identity or money stolen.
Since you analyze an immense amount of data each month, you certainly have a deep insight into your clients’ computing environments. What are some of the most interesting things you found out about users in general since you implemented the system?
The latest cool find was last December. Turns out that the first time in history, instead of Windows within a workstation, the most common malware lived in an embedded linux within a home appliance. More precisely, it was designed to hack MIPSEL, MIPS, ARM, PPC, SH4 -devices e.g. routers and modems that have telnetd with weak/default/no passwords, but it seemed to spread rapidly especially among IPTV boxes.
We worked together with the Finnish CERT community and found that there was some 17000 bots in that particular botnet. That was an interesting exercise as all the normal “patch your system, use AV” advice didn’t really apply, hardware firewall is probably a bit too much to ask and as it wasn’t our service, we didn’t know anything about the devices, e.g. so that we could instruct how to set the passwords or disable the telnet service altogether.
Talking about Linux, we recently saw a customer doing 445/TCP scans. Turns out the customer had downloaded a game which included a Trojan and was running it within Wine, so this meant Wine was able to emulate Windows good enough that it even had the ability to sneakily infect itself.
What advice would you give to other ISPs that have to battle same the challenging threat landscape as you?
Convince your management that this is beneficial to your company’s bottom line. Find evidence that reacting to customers infections increases customer loyalty. Think about practical examples. Make the process financially feasible and measurable. Ask for help. Share your experiences and knowledge.