Search code examples
cfrequencyanalysisextended-ascii8-bit

C. Storing char decimal value to array cant read/store 8bit characters


I have a question that I'm hoping you can help me with.

I'm trying to read chars from a file that i will perform a frequency analysis on. I decided the easiest way for this is to have an array that has index 0-255 and increment the corresponding index (from the read chars decimal value) by one every time that char is read. The problem i have is that it seems only the 7bit chars are saved. Look below for the code.

int frequency(FILE *freqfilep)
{    
    printf("frequency function called!\n");

    int start = 1;
    int *frqarray = calloc(256,sizeof(int));
    unsigned char tecken;

    FILE *fp;
    fp = fopen("freqfile.txt","r");

    if (fp == NULL) 
    {
        perror("Error in opening file");
        start = 0;
    }
    do
    {
        tecken = fgetc(fp);

        if (feof(fp))
        {
            start = 0;
        }
        else
        {
            frqarray[(int)tecken] ++;
        }
    }
    while (start != 0);

    printf("a%d\n", frqarray[97]);
    printf("b%d\n", frqarray[98]);
    printf("c%d\n", frqarray[99]);
    printf("1%d\n", frqarray[49]);
    printf("2%d\n", frqarray[50]);
    printf("3%d\n", frqarray[51]);
    printf("å%d\n", frqarray[134]);
    printf("ä%d\n", frqarray[132])
    printf("ö%d\n", frqarray[148]);

    fclose(fp);

    return 0;
}

The file I'm reading from contains the following chars:

aaa bbb ccc 111 222 333 ååå äää ööö

So the printf's in the bottom of my code should say:

a3
b3
c3
13
23
33
å3
ä3
ö3

But the result is

a3
b3
c3
13
23
33
å0
ä0
ö0

So I'm guessing that there is some issue with reading the 8bit characters, I've looked around a bit on the forum and found some relatively similar posts where the answer has been that I need to use a buffer like this fread(&buffer, 256, 1, file); but I'm not sure how to implement it.


Solution

  • Those characters are most likely not single byte characters with the high bit set, but multibyte characters.

    These characters are represented by the following UTF-8 codepoints:

    • å: 0xc3 0xa5 (decimal 195 165)

    • ä: 0xc3 0xa4 (decimal 195 164)

    • ö: 0xc3 0xb6 (decimal 195 182)

    Add the following to your code:

    printf("195 %d\n", frqarray[195]);
    printf("165 %d\n", frqarray[165]);
    printf("164 %d\n", frqarray[164]);
    printf("182 %d\n", frqarray[182]);
    

    And you'll probably get this output:

    195 9
    165 3
    164 3
    182 3
    

    EDIT:

    If you need to do frequency analysis of characters, use fgetwc to read in the characters instead. If you expect all characters to be in the basic multilingual set (Unicode characters U-0000 - U-FFFF) you can create an array of size 65536 and output that. If you're expecting characters beyond that range, you might want to use a different scheme.