Lessons learned from cracking 2 million LinkedIn passwords

Like everyone this week, I learned about a huge file of password hashes that had been leaked. The 120MB zip file contained 6,458,020 SHA-1 hashes of passwords for end-user accounts. At first, everyone was talking about a quick way to check if their password had been leaked. This simple Linux command line:

echo -n MyPassword | shasum | cut -c6-40

allows the user to create a SHA-1 sum of his password and take the 6th through 40th characters of the result. Then the user could easily search the 120MB file to see if his hash was present in the file. If it was, then of course his password had been leaked and his account associated with that password was at risk.

John the Ripper
When the OpenWall community released a patch to run John The Ripper on the leaked file, it caught my attention. It has been a long time since I have run John The Ripper, and I decided to install this new, community-enhanced “jumbo” version and apply the LinkedIn patch.

John the Ripper attempts to crack SHA-1 hashes of passwords by iterating on this process: 1. guess a password, 2. generate its SHA-1 hash, and 3. check if the generated hash matches a hash in the 120MB file. When it finds a match, then it knows it has a legitimate password. John the Ripper iterates in a very smart way, using word files (a.k.a. dictionary attack) and rules for word modifications, to make good guesses. It also has an incremental mode that can try any possible passwords (allowing you to define the set of passwords based on the length or the nature of the password, with numeric, uppercase, or special characters), but this becomes very compute-intensive for long passwords and large character sets.

The fact that the file of hashed passwords was not salted helps a lot. As an aside, even if they were salted, you could concentrate the cracking session to crack the easiest passwords first using the “single” mode of John the Ripper. But this works best with additional user information like a GECOS, which was not available in this case, at least to the public. So the difficulty would be much greater for salted hashes.

Approach
In my case, I have an old machine with no GPU and no rainbow table, so I decided to use good old dictionaries and rules.

I ran the default john command that just launches a small set of rules (like append/prepend 1 to every word, etc.) on a small default password dictionary of less than 4000 words. It then switches to incremental mode based on statistical analysis of known password structures, which helps it try the more likely passwords first. The result was quite impressive because after 4 hours I had approximately 900K passwords already cracked.

But then, as it got to the point were it was trying less and less likely passwords and therefore found matches more slowly, I decided to stop it and run a series of old dictionaries I had: from default common password lists (16KB of data) to words of every existing language (40MB of data). It was very efficient and found 500K more passwords in less than an hour, for a total of 1.4M passwords.

Even though my dictionaries were 10 years old and didn’t contain newer words like “linkedin”, it appeared that some cracking rules, by reversing strings or removing some vowels could guess new slang words from already cracked passwords.

And as I had just acquired 1.4M valid passwords, I believed that using these newly discovered passwords as a dictionary I could find more. It worked and the rules applied to the already cracked passwords produced 550K new ones. I ran a second iteration using the 550K passwords from the first iteration as a dictionary, and found 22K more. I iterated in this manner a total of ten times.

It is interesting to see that the most elaborate passwords found in the 3rd or 4th iteration of this kind of recursive dictionary cracking were related to the word linkedin most of the time. If I tried to match the word linkedin slightly modified (reversed or with ‘1’ or ‘!’ instead of ‘i’ like in l1nked1n):

  • In the first iteration, 558 passwords found in the 554,404 (0.1%) are related to the “Linkedin’ string
  • In the second iteration, 3248 out of 22,688 (14%) are related to the “Linkedin’ string
  • Third iteration: 1,733 out of 3,682 (47%)
  • Fourth iteration: 539 out of 917 (59%)
  • Fifth iteration: 217 out of 330 (66%)
  • Sixth iteration: 119 out of 152 (78%)
  • Seventh iteration: 40 out of 51 (78%)
  • And so on through the tenth iteration.

An example of what I found on the 7th pass is: m0c.nideknil. Another example is: lsw4linkedin, which was found on the tenth pass. To illustrate how the rules work for modifying words in the dictionary, below is the actual set of modifications used to get from the dictionary entry ‘pwlink’ to the successfully cracked password ‘lsw4linkedin’ over the ten iterations:

  • pwdlink from pwlink with the rule “insert d in 3rd position”
  • pwd4link from pwdlink with the rule “insert 4 in 4th position”
  • pwd4linked from pwd4link with the rule “append ed”
  • pw4linked from pwd4linked with the rule “remove 3rd char”
  • pw4linkedin from pw4linked with the rule “append in”
  • mpw4linkedin from pw4linkedin with the rule “prepend m”
  • mw4linkedin from mpw4linkedin with the rule “remove second character”
  • smw4linkedin from mw4linkedin with the rule “prepend s”
  • sw4linkedin from smw4linkedin with the rule “remove second character”
  • lsw4linkedin from sw4linkedin with the rule “prepend l”.

This is the deepest password found, i.e. the only one obtained in the last iteration. This clearly shows that no matter how elaborate a password you choose, as long as it is based on words and rules, even if there are many words and many rules, it will probably be cracked. The fact is that on a huge file like the LinkedIn leak, every password you find can help you to get another one. That is because human-created passwords are not random, and programs like John the Ripper and dictionary attacks can use patterns, either already known or discovered in the password hash file, to greatly reduce the time needed to crack them.

Password management
It is highly recommended to use a strong random password generator that is known to be actually random. It is funny to note that a very old version of a command line tool called “mkpasswd” produced passwords based on a bad random salt and was generating only 32768 different passwords, this was reported and fixed 10 years ago, but I was still able to recover 140 passwords in the leaked file that had been generated by this vulnerable version of mkpasswd.

Evidence indicates that the hacker who made this leak public was most likely trying to get cracked passwords from an online community, a kind of crowdsource cracking. Since he probably possesses the list of logins as well, you might want to change your passwords in other accounts if you think he can access them with the information he has. Note that if you have unique passwords created with simple rules, you might change them as well. For example, if your password for LinkedIn is MyPW4Linkedin, a malicious cracker might guess that MyPW4Facebook might be your Facebook password. It is also recommended to change your password if your username can be guessed from it, because every password cracker on the planet is currently playing with this password file.

The author of John the Ripper, Solar Designer, did a great presentation on the past, present and future of password security. Although the security industry has put a lot of work into making good hash functions (and there’s still more work to do), I believe that poorly chosen passwords are a concern. Maybe we should demand that our browsers (using secured storage as in Firefox Manager) or 3rd-party single-sign-on providers create easier solutions to help us resist the temptation of using simple passwords and re-using the same passwords with simple variations.

Note: The hashes in the 120MB file sometimes had their five first characters rewritten with 0. If we look at the 6th to 40th characters, we can even find duplicates of these substrings in the file meaning the first five characters have been used for some unknown purpose: is it LinkedIn that stores user information here? is it the initial attacker that tagged a set of account to compromise? This is unknown.

Don't miss