SHA collisions for Big files

Sensitivity of (SHA of long file) to early bits.

The SHA (Secure Hash) of a terabyte (TB) file is less sensitive to early bits than to late bits. Here is why and by how much.

After computing the SHA of a TB file you have pushed a 160 bit value thru the inner SHA code 17,000,000,000 times. The first 64 bytes of the file are supposed to influence the hash enough that it is infeasible to find two files that differ in the first 64 bytes and yet yield the same hash. This would be a collision. To the degree that the hash is not sensitive to the early bytes it is easier to find such a collision. Here is why not to worry.

SHA divides the file into 64 byte groups that feed into an iteration of 160 bit states thru the inner loop of SHA. Each iteration maps the set of 2¹⁶⁰ values into itself. The range of such a mapping is likely smaller that the domain. It shrinks each time. How much does it shrink after 17B iterations?

If f is a randomly chosen function (which we assume SHA to provide) of a large finite set S into itself and x is a random subset of S then #(f(x)) =1 − e^−#(x). Here #(x) is the fractional size of the subset x; #(x) = (number of elements in x)/(number of elements in S); 0 ≤ #(x) ≤ 1. What do you get when you iterate f 17B times? Asymptotically #(fⁿ(S)) = 2/n. For our TB file that means that the entropy provided by the first 160 bits is log(2¹⁶⁰*2/17,000,000,000) or about 161 − 34 = 127 bits of entropy.

Attacks involving finding collisions by modifying early bits in a TB file must include the cost of doing 17B SHA calculations in the inner loop. Such attacks pose no risks. The work factor increase from this inner loop hit, is proportional to the size of the file whereas the decrease from the range shrinkage gains by only the square root of the file size. To produce a collision you would need about 2⁶⁴ candidates for the first 64 bytes but for each of those you would need to compute 17B SHA iterations, a work factor of about 10¹⁹⁺¹⁰.

The Math

If x_n = #(fⁿ(S)) then x₀ = 1 and x_n+1 = 1 − e^−x_n. After n grows a bit this can be approximated by x_n+1 = x_n−x_n²/2. Or better x_n+1 − x_n = −x_n²/2. This recurrence relation is closely solved by x_n = 2/n. Computing x_n explicitly for a few thousand steps reveals that the formula is good even though the approximations are poor until n becomes large. For fixed k, x_n = k/n also fits the recurrence relation. The work estimate is not very sensitive to the value of k. The program below indicates that k=2.

If you trust computers more than my math (I do) you can read and run:

#include <stdio.h>
#include <math.h>
int main(){double x = 1;
for(int j=0; j<=10000; ++j) {
   if(j<10 || !(j%1000)) printf("j=%d x=%12.5e\n", j, x);
   x = (1.-exp(-x));}
return 0;}

which yields:

j=0 x= 1.00000e+00
j=1 x= 6.32121e-01
j=2 x= 4.68536e-01
j=3 x= 3.74082e-01
j=4 x= 3.12080e-01
j=5 x= 2.68077e-01
j=6 x= 2.35151e-01
j=7 x= 2.09548e-01
j=8 x= 1.89050e-01
j=9 x= 1.72255e-01
j=1000 x= 1.99183e-03
j=2000 x= 9.97838e-04
j=3000 x= 6.65675e-04
j=4000 x= 4.99430e-04
j=5000 x= 3.99629e-04
j=6000 x= 3.33073e-04
j=7000 x= 2.85521e-04
j=8000 x= 2.49850e-04
j=9000 x= 2.22103e-04
j=10000 x= 1.99903e-04