I have a question that I'm hoping you can help me with.
I'm trying to read chars from a file that i will perform a frequency analysis on. I decided the easiest way for this is to have an array that has index 0-255 and increment the corresponding index (from the read chars decimal value) by one every time that char is read. The problem i have is that it seems only the 7bit chars are saved. Look below for the code.
int frequency(FILE *freqfilep)
{
printf("frequency function called!\n");
int start = 1;
int *frqarray = calloc(256,sizeof(int));
unsigned char tecken;
FILE *fp;
fp = fopen("freqfile.txt","r");
if (fp == NULL)
{
perror("Error in opening file");
start = 0;
}
do
{
tecken = fgetc(fp);
if (feof(fp))
{
start = 0;
}
else
{
frqarray[(int)tecken] ++;
}
}
while (start != 0);
printf("a%d\n", frqarray[97]);
printf("b%d\n", frqarray[98]);
printf("c%d\n", frqarray[99]);
printf("1%d\n", frqarray[49]);
printf("2%d\n", frqarray[50]);
printf("3%d\n", frqarray[51]);
printf("å%d\n", frqarray[134]);
printf("ä%d\n", frqarray[132])
printf("ö%d\n", frqarray[148]);
fclose(fp);
return 0;
}
The file I'm reading from contains the following chars:
aaa bbb ccc 111 222 333 ååå äää ööö
So the printf's in the bottom of my code should say:
a3
b3
c3
13
23
33
å3
ä3
ö3
But the result is
a3
b3
c3
13
23
33
å0
ä0
ö0
So I'm guessing that there is some issue with reading the 8bit characters, I've looked around a bit on the forum and found some relatively similar posts where the answer has been that I need to use a buffer like this fread(&buffer, 256, 1, file);
but I'm not sure how to implement it.
Those characters are most likely not single byte characters with the high bit set, but multibyte characters.
These characters are represented by the following UTF-8 codepoints:
å: 0xc3 0xa5 (decimal 195 165)
ä: 0xc3 0xa4 (decimal 195 164)
ö: 0xc3 0xb6 (decimal 195 182)
Add the following to your code:
printf("195 %d\n", frqarray[195]);
printf("165 %d\n", frqarray[165]);
printf("164 %d\n", frqarray[164]);
printf("182 %d\n", frqarray[182]);
And you'll probably get this output:
195 9
165 3
164 3
182 3
EDIT:
If you need to do frequency analysis of characters, use fgetwc
to read in the characters instead. If you expect all characters to be in the basic multilingual set (Unicode characters U-0000 - U-FFFF) you can create an array of size 65536 and output that. If you're expecting characters beyond that range, you might want to use a different scheme.