Learnings from the SHAttered Security Incident
Researchers at the Cryptology Group at Centrum Wiskunde & Informatica (CWI), along with those at Google released two PDF files on February 23, 2017. Both files contained details of the Secure Hash Algorithm (SHA). Figure 1 shows the content of those files and the result of the SHA-1 validation.
Figure 1 Reveal of SHAttered
From Figure 1, you can conclude that even though the two PDF files are different, they have the same SHA-1 checksum value.
This simple fact shocked the security industry as it implies the success of the world’s first authentic and open SHA-1 collision test.
What is SHA-1?
In short, SHA-1 is a popular hash algorithm that supports an infinite length of input data and provides a fixed length of output.
Figure 2 Hash schematic diagram
As shown in Figure 2, the output of the hash function has no special meaning.
A well-designed hash function needs to (whenever possible) meet all the following conditions:
●Easily produce the output after receiving the input
●It should be extremely difficult to reverse-calculate the input from the output, to the point that we can consider this calculation irreversible
●It cannot maintain the output by changing the input (even by making a slight change)
●You cannot calculate the same output by providing two different inputs
Normally, one can use hash output values (also known as hash values or data extracts) as data fingerprints, which is essential for the field of cryptography.
SHA is a family of hash functions developed by the National Institute of Standards and Technology (NIST) for the U.S. Federal Information Processing Standard (FIPS).
Figure 3 SHA family
For this collision test, the algorithm in question turns out to be the SHA-1 hash algorithm, which is still very popular. The output of the algorithm is 160 bits and is represented by 40 hexadecimal numbers, as shown in Figure 1. In fact, the scenario which demonstrated repetitive SHA-1 checksum values escalated into something big.
Effects of SHAttered
Though Document  mentioned the theoretical collision with a complexity less than 269 as early as 2005, it only remained a theoretical study without any practical demonstration, even though earlier tests successfully optimized the complexity to 257.5 in Document  written in 2013. To settle the argument about the study, there was a need for evidence. Thus, the document titled “The first collision for full SHA-1” created the first collision instance.
Based on the study result of , the instance used the method named “identical-prefix collision attack”:
Namely, the prefixes (P) of the two messages are the same, and they are used to find two data pairs.
Meanwhile, make the SHA-1 outputs of both complete messages the same while their suffixes (S) can be of any value. Finding a matching data pair violates the requirement “You cannot calculate the same output by providing two different inputs,” which means the SHA-1 algorithm is defective.
Even though it is difficult to realize the collision, researchers eventually implemented the actual collision attack with a complexity of 263.1 with the help of GPU technologies. You may think that the fact shown in Figure 1 is just a coincidence (in fact, the occurrence probability of this coincidence is almost 0), then another instance provided in the thesis resolves your doubt, as shown in Figure 4.
Figure 4 SHA-1 collision instance
This actual attack uses JPEG images as an example, so the two images in the PDF files are different, and this makes the instance visually persuasive. By convention, the details of this collision attack (including technical details and source code) will be open to the public upon meeting certain conditions.
Git is a content-addressable file system, i.e., the storage of data as key-value pairs inside Git, and retrieval is the use of a key to search for desired content. In this way, any content submitted to Git generates a unique key through the hash algorithm, and one can use this key to retrieve the stored content uniquely. Coincidentally, Git uses SHA-1 as the hash algorithm.
The proof process is as follows.
By using a file as the example, Git uses the following method to calculate the hash value:
sha1(‘blob ‘ + filesize + ‘0’ + filedata)
Figure 5 Hash operations in Git
As shown in Figure 5, the three red frames represent three hash operations.
The first frame calculates a hash value by using the SHA-1 algorithm provided by OpenSSL.
The second frame calculates another hash value using the hash-object method provided by Git.
The third frame creates a repository and then checks it against its hash value after being committed.
The calculation results of these three operations are the same, which means Git uses the SHA-1 algorithm as its hash algorithm.
In fact, Git does not differentiate based on the names of processed files or objects but distinguishes them by their hash values. For Git, the hash value of an object is the unique ID of the object. However, this is false if it is possible to forge the ID.
Now, let us understand how Git resolves collided hash values.
Are you expecting another screenshot of a Git exploit? Well, not at this moment. In fact, the calculation workload of a collision is far higher than a PC can afford, while relevant technical details remain concealed, so the real-world collision instance still needs convincing proof. Also, Git does not directly calculate the hash value of the file, so the collision example shown in Figure 1 does not impact the operation of Git.
Alternatively, you can try to create a collision instance manually by modifying the implementation process in Git. By modifying the source code, Document  builds a simplified 4-bit SHA-1 version to explore collision.
Based on the test result, Git does not report any errors in multiple common scenarios, but the repository suffers from different types of damages.
A simple solution to this problem is to report the error and prompt users about the error. Git cannot operate correctly in this condition, but further damage is avoidable.
Compared with Git, it seems that HTTPS can better solve the same problem by using certificates.
In fact, Dr. Wang Xiaoyun mentioned the insecurity of SHA-1 as early as 2005 . In recent years, companies also began discarding SHA-1, as the following examples attest to:
•For SSL certificates, Windows terminated the support for SHA-1 certificates since January 1, 2017.
•For code signing certificates, Windows refused to accept SHA-1 signed codes and SHA-1 certificates without timestamps since January 1, 2016.
•The Google Chrome browser has gradually terminated support for SHA-1 certificates, while the latest Chrome version does not support these certificates at all.
•Mozilla does not trust any SHA-1 certificates since January 1, 2017.
As you can see from these cases, it is just a matter of time until the termination of support for SHA-1. The blockbuster news of SHAttered may accelerate this process.
For large companies, replacing old certificates is easy since it does not involve client distribution. However, the real challenge lies on the client end: for application development engineers, the difficulty is that legacy clients do not support certain new features. On the other hand, in the security industry, clients support outdated features.
For this reason, replacing certificates does not resolve all related problems for large companies. Using HTTPS as an example, hackers can forge a certificate with the SHA-1 signature as long as the browser still supports SHA-1, even though certificates with higher security are already in use. This is because the browser directly trusts the received false SHA-1 certificate without knowing about the existence of updated certificates. To resolve this problem completely, the browser must never support SHA-1 anymore.
(The following uses Windows as the example. For other products, the situation can be worse.)
Fortunately, we are living in a time where IT technologies evolve rapidly, and even Windows 10 has iterated multiple builds. Windows 7 is also robust enough despite being an old product that came out eight years ago.
However, this is not true for Windows XP. Since Windows XP SP3, the subsequent versions support SHA-2 and discard SHA-1. But the question remains: how many Windows XP users are there in China? Several million? Well, those users need not worry about the SP number if they fail to choose to upgrade to a newer OS version. Only time can tell if users will completely discard Windows XP or not.
Historical Progress and Improvement Suggestions
Furthermore, this incident also reveals the price for collision attacks.
Compared with MD5 whose price is 30 seconds on the mobile phone, the price for SHA-1 is 110 GPUs for one year. Though this price is affordable for large companies, it is too high for middle- and small-sized enterprises and personal users. Also, do not forget that Moore’s law is still applicable. So we must consider possessing computing resources with better efficiency and cost-effectiveness.
Figure 6 Comparison of prices
In other words, as the security industry advances, it is impractical to stick to a single algorithm. In fact, each cryptographic algorithm has its lifecycle, and constant iteration optimizes and improves it, and has led to today’s robust cryptography.
For example, Document  provides the lifecycle of hash.
Figure 7 Lifecycle of hash
In the above figure, red represents an outdated hash, yellow indicates a slightly defective one, while green indicates a healthy hash.
This puts an end to SHA-1 while SHA-2 (such as SHA-256) and SHA-3 (Keccak is the final winner) are still alive.
So, the improvement logic is simple: use SHA-2 or SHA-3 for new projects. SHA-2 is quite mature and allows direct use. SHA-3 also has various efficient implementations, and they are all secure hash algorithms (Keccak is highly recommended.)
For projects under development, it is mandatory to replace the SHA algorithm to prevent major problems arising in the future.
For existing projects such as Git, many iterations are likely as it takes a lot of work to update. As mentioned in , replacing hard-coded unsigned char takes significant effort. (The outputs of SHA-256 and SHA-3 are greater than 160 bits, so 20 bytes of space is insufficient.)
Additionally, you do not have to replace the SHA algorithm for security-insensitive applications. Theoretically, MD5 should not be in use for over a decade. But practically, it is still used in many applications. That is because software engineering can invoke a chain reaction and it is not always worth the cost of introducing other unexpected bugs to replace an algorithm.
Developers recommend common users to replace the SHA algorithm whenever possible. Purchasing up-to-date consumer electronic products boost security. Even though you cannot upgrade hardware, updating software on time eliminates numerous security concerns (though it may introduce some other minor problems).
In this article, we discussed the Security Hash Algorithm, with respect to the revelation that researchers have successfully carried out an SHA-1 collision test, which means that they achieved the same output for two different inputs. The result violates the inherent conditions for an SHA-1, thereby its reference to a collision. Next, we discussed running the hash operations on a Git, and an HTTPS instance. Then we looked at the lifecycle of the hash function, along with some best practices for users to deal with the SHA algorithm.
 How would Git handle a SHA-1 collision on a blob? http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob
 Wang X, Yin Y L, Yu H. Finding collisions in the full SHA-1[C]//Annual International Cryptology Conference. Springer Berlin Heidelberg, 2005: 17–36.
 Lifetimes of cryptographic hash functions http://valerieaurora.org/hash.html
 Why doesn’t Git use more modern SHA? http://stackoverflow.com/questions/28159071/why-doesnt-git-use-more-modern-sha
 Stevens M. New collision attacks on SHA-1 based on optimal joint local-collision analysis[C]//Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2013: 245–261.