Search code examples
xcodeutf-8ascii

how to strip out non-human-readable character at the start of each line using Xcode


I am trying to set up Xcode to get rid of non-human readable characters in legacy text files recovered from 8” floppy disks created in 1986. The files were created in QDOS, a proprietary disk operating system using a text-based Music Composition Language application aka MCL.

I aim to write a C program to read the ascii file, character by character, filter out non-printable characters from the source file and save it to a destination file thereby making it possible to view file contents in exactly the same format a composer would have seen it in 1986.

When Xcode reads the legacy text file, the unwanted character appears as the first human readable character of every line except the first line.

    !B=24:Af
    *           BAR 1
    G2,6
     *           BAR 2 & 3
    !G2,1/4:Bf2,1/4:C2,1/4:Ef2,1/4:F3,1/4:G3,35/4:D3:A4
    "*           BAR 4 
    #Bf4:G4,2:D3:A4:Bf4
    $*           BAR 5
    %D4,2:C4,3:F5
    &*           BAR 6
    'D4:Bf4:A4,2:G4:D3:?
    (*           BAR 7 &

A hex dump of the above text file shows the two ascii bytes $0D (Carriage Return) followed by $1C (File Separator). These two bytes plus the byte that follows immediately after them, are the characters I am trying to remove.

    0000: 1C 1D 21 42 3D 32 34 3A 41 66 0A 1C 1E 2A 20 20   ¿¿!B=24:Af¬¿¿*  
    0010: 20 20 20 20 20 20 20 20 20 42 41 52 20 31 0A 1C            BAR 1¬¿
    0020: 1F 47 32 2C 36 0A 1C 20 2A 20 20 20 20 20 20 20   ¿G2,6¬¿ *       
    0030: 20 20 20 20 42 41 52 20 32 20 26 20 33 0A 1C 21       BAR 2 & 3¬¿!
    0040: 47 32 2C 31 2F 34 3A 42 66 32 2C 31 2F 34 3A 43   G2,1/4:Bf2,1/4:C
    0050: 32 2C 31 2F 34 3A 45 66 32 2C 31 2F 34 3A 46 33   2,1/4:Ef2,1/4:F3
    0060: 2C 31 2F 34 3A 47 33 2C 33 35 2F 34 3A 44 33 3A   ,1/4:G3,35/4:D3:
    0070: 41 34 0A 1C 22 2A 20 20 20 20 20 20 20 20 20 20   A4¬¿"*          
    0080: 20 42 41 52 20 34 20 0A 1C 23 42 66 34 3A 47 34    BAR 4 ¬¿#Bf4:G4
    0090: 2C 32 3A 44 33 3A 41 34 3A 42 66 34 0A 1C 24 2A   ,2:D3:A4:Bf4¬¿$*
    00A0: 20 20 20 20 20 20 20 20 20 20 20 42 41 52 20 35              BAR 5
    00B0: 0A 1C 25 44 34 2C 32 3A 43 34 2C 33 3A 46 35 0A   ¬¿%D4,2:C4,3:F5¬
    00C0: 1C 26 2A 20 20 20 20 20 20 20 20 20 20 20 42 41   ¿&*           BA
    00D0: 52 20 36 0A 1C 27 44 34 3A 42 66 34 3A 41 34 2C   R 6¬¿'D4:Bf4:A4,
    00E0: 32 3A 47 34 3A 44 33 3A 3F 0A 1C 28 2A 20 20 20   2:G4:D3:?¬¿(*   
    00F0: 20 20 20 20 20 20 20 20 42 41 52 20 37 20 26 20           BAR 7 & 

I created an Xcode Command Line Tool Project. When I select Type : Plain Text and Text Encoding : Unicode (UTF-8) in the Xcode Inspectors Window the same single printable character is visible. I chose those settings because my MacOS expects en_AU.UTF-8.

The C code that follows makes an identical copy of the text file without identifying individual characters. Essentially it will read old file contents and write a new file successfully. The hex dump for the output file is identical to the hex dump above.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    int main(int argc, const char * argv[]) {

    char filename[] = {"~/Desktop/MCLRead/bell1.ss"} ;

    printf("MCLRead\n\t%s\n", filename);

    FILE* fin = fopen(filename, "r");
    if (!fin) { perror("input error"); return 0; }

    FILE* fout = fopen("output.txt", "w");
    if (!fout) { perror("fout error"); return 0; }

    fseek(fin, 0, SEEK_END); // go to the end of file
    size_t filesize = ftell(fin); // get file size
    fseek(fin, 0, SEEK_SET); // go back to the beginning

    //allocate enough memory
    char* buffer = malloc(filesize * sizeof(char));

    //read one character at a time (or `fread` the whole file)

    size_t i = 0;
    while (1)
    {
        int c = fgetc(fin);
        if (c == EOF) break;

    //save to buffer
        buffer[i++] = (char)c;
    }

However when I compile, build and run this in Xcode the characters are unrecognisable regardless of the Type or Text Encoding settings in the Xcode Inspectors Window. The following error message appears in the Console Window

    error: No such file or directory
    Program ended with exit code: 0

When I run the same code in the Terminal Window it generates an output text file but the characters are unrecognisable

    Desktop % gcc main.c
    Desktop % ./a.out output.txt
    Desktop % cat output.txt                                           

cat results in a string of 128 ? characters in the Terminal Command Line - a total of 128 even though the file contains more than a thousand characters in total.

Can someone give me any clues for making this text file readable in a format that allows the non-human-readable characters to be stripped from the start of each line.

Please note, I am not asking for help to write the C code but rather what Text Format will make the unwanted 8-bit characters readable so I can remove them (a slight refinement on the question I asked initially). Any further help would be most appreciated. Thanks in advance.


Note

This post has been revised in response to comments.

The hex dump has been done as text rather than as an image. This offers the most reliable way to share the text file for anyone who wants to test what I have done



Solution

  • The problem can be solved easily by reading each byte as a 7-bit binary value using int not char. Source file is read in hex, saved in decimal and read as text.

    Note. There is no EOF character. MCL used the word 'END' at the end of the file. Because it has been salvaged from a floppy disk image, the file sometimes has a trailing string of hex E5 characters written on the floppy disk when it was formatted. At other times where the format track is already overwritten the file has a trailing string of zeros.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #define CR          0x0D                        // ASCII Carriage Return
    #define FS          0x1C                        // ASCII File Separator
    #define FD_FORMAT   0xE5                        // floppy disk format track
    
    int main(int argc, const char * argv[]) 
    {   
        char fname[20];
        printf("\n Enter MCL file name : ");
        scanf("%s", fname);
        printf("\n\t%s\n", fname);
    
        int a = 0;                                  // init CR holder
        int b = a;                                  // init File Separator holder
        FILE* fin = fopen(fname, "r");              // init read
        if (!fin) 
        { perror("input error"); return 0;
        }    
        FILE* fout = fopen("output.txt", "w");      // init write
        if (!fout) 
        { perror("fout error"); return 0; 
        }
        fseek(fin, 0, SEEK_END);                    // look for end of file
        size_t fsize = ftell(fin);                  // get file size
        fseek(fin, 0, SEEK_SET);                    // go back to the start                                             
        int* buffer = malloc(fsize * sizeof(int));  // allocate buffer                          
        size_t i = 0;
        while (1)
        {
            int c  = fgetc(fin);                    // read one byte at a time
            if (c  < CR)  break;                    // skip low control codes
            if (c == FD_FORMAT) break;              // skip floppy format track
            
            printf("\t%X", a);
            printf("\t%X", b);  
    
            if ((a != CR) && (b != FS))             // skip save if new line        
            {
            printf("\t%0X\n", c);
            buffer[i++] = c;                        // save to buffer   
            }                  
            a = b;
            b = c;
        }   
        for (i = 0; i < fsize; i++)                 // write out int by int
            fputc(buffer[i], fout);
        free(buffer);
        fclose(fin);
        fclose(fout);
        return 0;
    }