Search code examples
cfilehashsha1

Building a Merkle tree for a text file in C, every time the process is repeated, a new root hash is generated


So I am trying to identify the root hash for a text file by first calculating the SHA1 hashes for 64-byte lines, concatenating them and again finding the hash for the concatenated hash. My overall process is something like this,

Read the file in 64-byte lines > Hash each line and write to a file[hashes.txt] > concatenate hashes two at a time and write to another file[temp_hashes.txt] > Hash the temporary, concatenated hashes and write back to [hashes.txt].

I repeat this process until the length of [hashes.txt] is one. Finally, I write this to my permanent record [secure.txt].

I am using the library . I've used two text files for testing, let's call them [one.txt] and [two.txt]. Both have some excerpts from lorem ipsum. Now everything seems fine till the first 64-byte line hashing step, but as soon as I combine it, the root hash becomes unique every time I run the code. I have tried emptying both [hashes.txt] and [temp_hashes.txt] and re-running.

This is my first hash step.

char buf[64];

unsigned char all_hashes[TABLE_SIZE][21];
unsigned char md[SHA_DIGEST_LENGTH];

while (fgets(buf, sizeof(buf), fptr) != NULL){
    get_sha1_hash(buf, sizeof(buf), md);
    for(int i = 0; i < SHA_DIGEST_LENGTH; i++)
        fprintf(outfile, "%02x", md[i]);
    fprintf(outfile, "\n");
}

The combining is something like this

char * temp = malloc(sizeof(char)*100);
char * line = malloc(sizeof(char)*100);
int k = 0;

while (fgets(line, 100, file) != NULL) {
    line[strlen(line)-1] = '\0';
    if (k%2 == 0) {
        fprintf(outfile, "%s", line);
    }
    else {
        fprintf(outfile, "%s\n", line);
    }
    k++;
}

And this is the re-hash step

char line[1024]; // I guess the same as char line[100]
int i = 0;
unsigned char md[SHA_DIGEST_LENGTH];
while(fgets(line, sizeof(line), infile) != NULL) {
    get_sha1_hash(line, sizeof(line), md);
    for(int i = 0; i<SHA_DIGEST_LENGTH; i++)
        fprintf(outfile, "%02x", md[i]);
    fprintf(outfile, "%s", "\n");
}

Finally, everything comes together like this

while(calculate_length_of_file("hashes.txt") > 1) {
    combine_hashes_by_two();
    hash_file_line_by_line();
}

I am just starting out with C and have made trivial memory mistakes before, I think it must be something simple here too, just can't seem to crack it.

Any and all help will be greatly appreciated, thank you!


Solution

  • The problem is:

    Here, you read a line into the buffer buf[64]:

    while (fgets(buf, sizeof(buf), fptr) != NULL){
    

    Here, you hash the complete buffer:

        get_sha1_hash(buf, sizeof(buf), md);
    

    but fgets() might not have read the whole buffer in; it only reads until the next newline!

    So, probably you meant to hash:

        get_sha1_hash(buf, strlen(buf), md);
    

    Otherwise, you also hash some uninitialized content at the end of buf, which leads to (pseudo-)random results.