Theoretical Cryptography Part I - MD5
Since Martin constantly manages to cover all ongoing news regarding cryptography and security I myself planned on writing about "any time soon", I had to look for other topics on the subject I could cover. So I decided to make good use of all the time I spend studying this kind of stuff and share this knowledge with you. And I would be very grateful, if someone could tell me in return how to defeat shirred fabrics with my sewing machine... but that's a whole different matter.
So for the first Part of this - hopefully ongoing - series, I decided to look at the MD5 hash algorithm. It's one of the most commonly used cryptographic algorithms out there and I would claim that nearly everyone has a password somewhere that is stored with an MD5 or similar hash.
MD5 stands for Message-Digest Algorithm 5, and is - as already mentioned and you probably already knew - a hash algorithm.
The MD5 hash algorithm is in simple terms a deterministic function (or blackbox) that will calculate a 128-Bit hash value from a given string of well-nigh any length ... yeah, I had to read this sentence over a few times, and it's just rubbish. If I wanted to write something like that, I could've gone Wikipedia. So let's crack this one open.
You feed the MD5-Box a string of any length you want. This "string" doesn't have to be alphanumeric of course, any stream of bits and bytes is just fine, like the bitstream of a file, for instance. The output string has always a length of 128 bits and is usually noted as a string of 32 octets, like this one: "B5A8AD3A9CDD6A6953FCBE6975FDE734" (try guessing what I typed in though).
One of the most important things about hashes is, that they are so-called one-way-functions, meaning, they only encrypt stuff, and can't - and must not - be decrypted. So hashes are often used for storing passwords in a databases. The same plaintext will always be hashed to the same cipher text with MD5, so all you have to do to check if your password and the stored (hashed) password are identical is to compute the hash of the given password and compare it with the stored one.
There are several demands a good hash-function has to meet in order not to get cracked in the first two hours of its lifetime.
The first one is, that a minor change in the plaintext (like "ghacks" and "gHacks") should have a big impact on the computed hash ("D1B81FBDEB51C3A850E37177A5A22498" and "DB3E20DC88EF0B6CA6A8FD5DA448D323"). If the difference would be only minor, and I know the plaintext and hash of "ghacks" (which I do, of course), and have the hash of "gHacks" without the knowledge of its plaintext, I could easily guess it.
The second very important demand is that a hash-function produces a much smaller memory imprint than the original stream. If you hash an 11MB installer to verify its integrity and have to download another 10MB of hash file as well, it's pretty useless. There are lots of other points to keep an eye on, but these will (and have to) suffice.
As I mentioned already, hash-functions such as MD5 are most commonly used to store passwords without actually storing them in plaintext, and to verify the integrity of files. When you put a file online, just compute the hash and publish it together (but separate) with the file. Ever user would be able to determine if the downloaded file has been tampered with by simply comparing the hash of the downloaded file with the one published on the website.
Now I'd like to say something about security and known (and partly successful) attacks against hashes and MD5 in particular.
Due to the reduction (a 2MB file gets reduced to a 32-octet hash), information gets lost. This gets perfectly clear, if you take a look at the numbers. There are only 2^128 possible hash values, but infinite possible plaintexts. So in a best-case-scenario, after hashing plaintext numbers (2^128)+1 you have at least two plaintexts getting mapped on one and the same hash value.
So the first attack tries to make use of this very fact. When the same hash value is calculated from two different plaintexts, it is called a collision. Depending on the scenario of the attack using collisions, the birthday paradox comes in handy as well, increasing the attackers chance of success.
That would mean that you do not attempt to break the encryption or guess the user's password when trying to crack a password, but just try to create another password that leads to the very same hash value, granting you access to the account. Of course, knowledge of the hashed password is required, but without that information, most attacks on modern ciphers are more than just tricky.
Edit: please take a look at comments for more clarification on the types of attacks mentioned above.
The second attack is based on a brute-force attack, which is basically "try all possible keys/passwords". Depending on the numbers this could take some time. Let's say you've already acquired the target hash value and your machine is able to try 100 keys per ms. That would make 100.000 keys per second, and 6.000.000 keys per minute. 2^128 hash values. That's 3.4E38. We're talking "age of the universe in seconds"-numbers here.
But there's more to it than meets the eye. There are several options to reduce the available possibilities. Can you reduce the amount of possible plaintexts maybe? Maybe the password only allows to be 8 alphanumeric letters long? Can you have a look at the used algorithm and find something that may help you further? Do you know part of the plaintext? Maybe a name of son/wife/pet? Then you could combine it with a dictionary-attack. Every bit of information helps reducing the number of possibilities further, which in the end leads to a situation like this:
The following is a description of an attack to crack the user passwords of windows accounts (up to XP), and implemented in a near-perfect way by ophcrack. If interested, do make sure to check this tutorial, it's quite fascinating and yet unbelievably scary.
Windows saves hash values of the user passwords, but if a password is longer than 7 signs, it gets broken up into chunks of length <= 7. Then the chunks get converted to uppercase only. Microsoft used DES for creating the hashes, but there's no difference regarding this kind of attack.
So the attacker knows pretty much about the plaintext and can reduce its possibilities by a great deal. Now a computer starts calculating all possible hash values for this particular range of plaintexts (up to 7 digits, uppercase, numbers and some special characters only) and stores them in a database. Once finished, the database is from about 0.7 to 4 GB in size and can be easily transported using a thumb drive or a DVD.
Now all the attacker needs is a few minutes alone with the target computer and it's done. Again, check the tutorial mentioned above, it kinda blew my mind. 1.7 minutes was the average time in this experiment for cracking a password to your windows account. Ouch.
Since I read and heard all of the above some time ago, I started wondering about the benefits and risks of using MD5. Most security experts discourage the use of MD5 nowadays for its known vulnerability to collision attacks. It should be replaced by something like the SHA-1 or since it is kind of outdated as well the even newer SHA-512. But that doesn't help against the attack last mentioned, apart from increasing the possible hash values to even greater dimensions.
After some time, I found this very helpful article about spicing up your hashs to be more secure. I have to say though, these tips are NOT increasing the security of your hash function in a mathematical way. Luckily, the real world's not all about math, so I think they are an easy way to get some extra security.
Edit: Please keep in mind that the tutorial posted here is not a perfect implementation of salts. It's - as always - a source for ideas, not a perfect solution. But I always like it more if it's explained like that, easy and understandable and in a rather digestible way. Please correct me if I'm mistaken.
If you want to screw around with MD5 a bit, here's a link to an applet where you can do just that (SHA-1 as well). Switch to MD5, enter some text and press "Text digest". Try guessing my hash from above (reaaaal easy), if you like and post the answer in the comments. First to score gets a cookie ;)
Stay tuned for upcoming ramblings about encryption and stuff. Maybe AES will be next.
Advertisement
…’course not ;)
but its , without the http://
Edit: *lol* The damn wordpress just added it… my bad..
Here, catch a snack!! *waveswithcookie* ;)
http://www.ghacks.net => B5A8AD3A9CDD6A6953FCBE6975FDE734
You weren’t lying when you said it was easy
@Martin: Since I like being honest, and get the feeling I can be honest here as well, I gotta admit it came rather snooty (didn’t knew that word before ;). BUT (!) I’m kinda …let’s call it sensitive in means like this, so I gotta excuse me for snapping at you.
You’re pretty right about the second point though, one should know the limitations of the security implemented. As you probably guessed by now, I was trying to reach a target audience representing much smaller projects and such. The big ones should probably be able to handle themselves, or at least I’d like to believe that *looks at the Microsoft example* …nevermind.
Nah, I try to reach the more private persons here. Open-Source-Freaks, Forum guys, blogees. Watch out for your private data! And if you happen to get these from other people, be responsible as well with them. That’s the kind of message I’d like to get through.
No hacker that’s out for easy prey will continue cracking your database with forum users if it gets expensive.
That’s like the base rule of security for me I forgot to mention, but will do in the future ;)
“A system is secure enough, if the amount of effort (time, money and else) to break it is far greater than the use of the broken system”
or something like that.. should get more catchy. :)
Anyways, no hard feelings here. I’m a man of rhetorics and I’m somewhat a pro in my business as well (at least for my young age ;), I just ..decide willingly to play in the minor league.. like it better here ;) and let the players that are really up to it handle the big ones..
though I gotta admit it sometimes limits my horizon to the level of the above text..
anyway. cheers!
PS: No sewing tips whatsoever. I’m shocked. ;)
@Stefan: I hope I didn’t come across as too snooty. Your reply makes a lot of good points. It’s great that so many people are interested in cryptography, and there’s certainly nothing wrong with stirring interest in the topic. Every programmer could use more knowledge about security, for sure.
Also, some security tends to be better than no security, and the simple salted-hash-of-password scheme linked to may work for a small site with a limited threat/risk. Just know that if a project starts to get big (and you attract more sophisticated hackers), all crypto-related code needs to be replaced with industry-standard stuff at some point.
Well then..
Guess I should clarify one thing: This – as well as the follow-ups – is NOT going to be an expertise for IT-specialists in their 8th semester. This is for interested “normal” people that want to take a closer look, and therefore will stay this – hopefully balanced – way.
@Martin Cochran
– whoops! Lost in translation… corrected.
– very debatable. I see your point though. Please remember english is not my native tongue.
hash-functions are nonetheless cryptographic functions. Also, have you counted how often I used “hash” in this short text? Try imagine how much more there would be without “ciphertext” :p
– of course they are, though the ones I mentioned are the ones the average user has immediate contact with. Many have seen the MD5 hash below a download link, and everyone has a hash-stored password somewhere. I have to somewhat limit this to a readable length.
– ..well then. the birthday paradox DOES apply to finding collisions and therefore “collision attacks”, reducing the average number of operations from 2^n to 2^(n/2). the second preimage attack differs from that one, that I try not to find any pair, but a certain second “message” to a given first message m1.
– Since Cryptography is a living science and not a rather-dead-one-for-rather-dead-people, one’s opinions may differ. MD5 is broken, alright, but there are worse things. Like DES.
Plus, I never said anything about developing a _cryptographic_ piece of software, but nigh every software today has a user management. I don’t like plaintext-stored passwords, and I don’t like a hash or cipher used for them a five-year-old can crack. So it is rather important that “normal” developers without specialization on cryptos also get a hang on this. And as I already mentioned, SHA-512 would be the way to go.
– Like most of the things I write, this is NOT a step-for-step-tutorial and shouldn’t be regarded as a reason to stop thinking. It should give a very brief view of the thematics of hashes. I have yet to implement the salted hashes, and if one ever gets the feeling the linked tutorial is weak, he may post or even write a better one. Please do remember, these are all just ideas, to get people interested in stuff like that, for it is to become a point of major interest.
As for the problems you mentioned, yeah, when thinking about I find two without even considering wikipedia. So would others.
Thanks for the corrections and feedback though.
@Bilal: Thank you very much
@Daniel: I hope so! :) I think it’s a bit too long, gonna have to shorten things a bit in the future.
Let me please say it again: This is NOT the wisdom of galaxies. I do NOT claim to be an expert on this kind of things (as always). I know more than some people though, and I wanna share, since I have major interest that more people get used to this stuff, get interested, and eventually get more and more interested in using it to protect themselves and others as well.
I write ideas, not dogmata, in a way that’s hopefully pleasing readers. Everyone could, at any hour, hop over to wikipedia and look for “those hash functions I wanted to know about for half my adult life”, but few do. Plus, it’s much easier to understand when presented in such a matter.
Thanks.
And: Delete All Cookies! ;)
Hi Stefan!
This promises to be so cool! I love the fact that it’s in depth as well, I don’t even have time to read it all, fantastic! I’m bookmarking and coming back :)
Wow. This is a great explanation of hashing in general, although maybe not the differences between hashing algorithms.
I am currently writing my dissertation on cryptographic hash functions.
A few picky clarifications:
– MD5 is a deterministic function, not non-deterministic.
– Use of the words ‘encryption’ and ‘cipher’ with regard to hash functions doesn’t really make sense. Similarly, ‘plaintext’ and ‘ciphertext’ aren’t really applicable. Probably clearer to use ‘string’ or ‘message’ and ‘hash.’
– Hash functions are used for many, many more applications than storing passwords and file integrity checks. SSL, IPSec, TLS, and every other crypto library out there uses them for many, many different crypto applications (digital signatures, MACs, authentication schemes, commitment protocols, etc).
– The paragraph after you mention the birthday paradox is actually a second preimage attack (the second kind you mention) and the birthday paradox does not apply.
– MD5 is considered *very* broken. SHA-1 is less so, but if developing a new application (and you shouldn’t be developing a crypto application if you’re not a cryptographer, but that’s another story), use SHA-256, SHA-384, or SHA-512.
– Adding a salt to the password hash is a useful technique, as you mention. However, the article to which you link has a number of problems as well, which I do not have space to go into here. Poke around on wikipedia and see if you can’t find out for yourself what the problems are.