Why I still use MD5

If you started programming in the early 2000s or before, you’ve probably used MD5. A lot.

But then, following the collision attack published in 2004 by Xiaoyun Wang, the most widely used hashing algorithm started falling out of favor, to the point that mentioning it is now guaranteed to provoke knee-jerk reactions. But as often, there’s more to it, and we’ll get to it in a second.

MD5 is a hashing algorithm designed in 1991 by Ronald Rivest, a major figure in cryptography. It takes an input of any size and generates a deterministic 128-bit output, in a fast way and with a good avalanche effect. As the de facto reference of hash functions, it got implemented into almost every programming language and framework, which further increased its widespread use, to the point that despite being considered cryptographically broken and unsuitable for most applications, it’s nowhere near gone as of today.

But how is MD5 broken exactly? It boils down to one thing: collisions. A collision is when two different inputs yield the same output for a given hash function. You might be thinking that since the set of possible inputs is infinite (MD5 takes inputs of arbitrary size), and the set of possible outputs isn’t (outputs are 128 bits, often represented as a sequence of 32 hexadecimal digits), then there is an infinite amount of collisions (that’s the pigeonhole principle). Well, you’re right, but a main security requirement of hash functions is, more specifically, that collisions be sufficiently hard to find, and that’s not the case with MD5. More specifically, any attack that finds a collision in way less than \(2^{\frac{n}{2}}\) operations for a hash function with \(n\)-bit outputs is said to break that security requirement: this number comes from the birthday paradox, which states that given a population of \(k\) equally probable values, we need roughly \(\sqrt{k}\) random samples to expect to find a single collision.

Concerns about MD5 aren’t new: in 1996, Hans Dobbertin published a collision attack on one of the internal elements of MD5, the compression function. Cryptographers started discouraging its use, and they turned out to be right as Wang’s 2004 findings were in fact based on Dobbertin’s. But is MD5 broken for everything?

Hash functions can be used for many things. There’s the usual hash tables, which map keys to values using hashes, and derived applications, such as caches, fingerprinting, duplicate record management, fast data comparison, etc. And then there’s the cryptographic applications, such as password storage, proof-of-work, and file integrity verification. The latter are critical to application security, and almost always require a high resistance to collisions. Else, an attacker could just, for example, forge a fake signed message by replacing the original message with one that yields the same hash, since it is usually hashes that are given as an input to digital signature algorithms.

But what about non-cryptographic uses of hash functions? In most cases, all that is needed from them is that they’re fast. For example, in an application that checks if a file is already present in a database by checking the hash first, and then only comparing the files if the hashes match, you don’t care much about how easy it is to generate collisions.

And that’s where MD5 comes into play: it is much faster than SHA-256, the current reference for (cryptographic) hashing. It is hard to give an all-encompassing figure because some algorithms will perform better on certain architectures (notably multiple cores), but most benchmarks show that MD5 is faster by factor of ten—an unsurprising result given its simplicity.

So, if you’re building a performance-critical application that needs to calculate hashes all the time in situations where collisions do not represent a danger (for example, checking if a command is already being run with the same parameters and abort if that’s the case in order to prevent data from being processed twice), then MD5 could come in handy.

Else, you can ignore this clickbait article and go back to using proper cryptographic hash functions. 🙂