Search code examples
c++11ramfile-format

Recognizing file formats from binary (C++)


I am a beginner C++ programmer.

I wrote a simple program that creates a char array (the size is user's choice) and reads what previous information was in it. Often you can find something that makes sense (I always find the alphabet?) but most of it is just strange characters. I made it output into a binary file.

However, How do I:

  1. Recognize the different chunks of data

  2. Recognize what chunks are what file format (i.e. what chunk is an image, audio, text, etc.)

My Code:

// main.cpp
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <fstream>

using namespace std;

int main() {
    int memory_size  = 4000;
    string data = "";
    bool inFile = false;

    cout << "How many bytes do you want to retrieve? (-1 to exit)\n";
    cin >> memory_size;

    string y_n;
    cout << "Would you like to write output into a file? (Y/N)\n";
    cin >> y_n;
    if (y_n.compare("Y") == 0 || y_n.compare("y") == 0)
        inFile = true;
    else
        inFile = false;

    char memory_chunk[memory_size];


    for (int i=0;i<memory_size;i++) {
        cout << memory_chunk[i] << "";
        data += memory_chunk[i] + "";
    }



    if (inFile) {
        ofstream file("output.binary", ios::out | ios::binary);
        file.write(memory_chunk, sizeof memory_chunk);  
        file.close();
    }   

    cin >> data;

    return 0;
}

Example of the retrieved data: (This is A LOT smaller than what it usually can retrieve)

   dû(       L)         àýtú( ¯1Œw ÐýDú( @ú( Lú(     dû( ¼û(        L)         º
                        ‰v8û(    7Œw           û(  ú( 0ý(     k7Œwdû( @                                                   5 À        ü( ¨›w    ó˜wÞ¯  › Ø›     0ý(     Hû(     À ›     `›   À  Dû( LŒw  › @›     `›       › lû( ÷Œw  ›  › ˜›   › û( 3YŒw  ›     ~Œw ›            €›   › à›     Dü(      › €› Dü( ßWŒwXŒwDÞ¯ ›   ›        €› ˆ› À › ¦›   ›  !› :   À › `›      À  ü(      › ˆ› V   €› 
          Œw   ˆ›           ¬û(     Äÿ( ‘Q‡w€ôçþÿÿÿXŒwµTŒw      ‚› xü(       È6‹w  ›        À×F           fÍñt"ãŠvEA @ÒF    ¸ü( 
                      þÿÿÿ@ÒF Ã~“v           Øü( O¯‰vØÞ¯øü( œ›‰v  ›                   ˆý( ‡ÌE    @ÒF 
      8|“v ý(   ‰v@M“v,ý( wî‰v   hý( ¬_‘v8|“v˜_‘vݧY‘   ÀwF    
   <ý(     Äÿ( e‹vàçþÿÿÿ˜_‘v"A 
   8|“v@ÒF        ÀwF    ïÀE           ÕF        ”› ÓºA ”› ÕF    lF €F  F 2    àýàý( ð      @ 

Solution

  • Some file formats start with magic numbers that help to identify them, though this is not always the case. Wikipedia has some here: http://en.wikipedia.org/wiki/List_of_file_signatures. The unix command 'file' trys to guess file formats based on magic numbers in the data. The source code to that is most likely available somewhere. (apple darwin sources if nowhere else).