Search code examples
cstringbioinformaticsatoi

C: parse char array of digits separated by characters and turn digits into integers


I have a problem with some strings created by a sequence alignment program (this is a bioinformatics project). I am attempting to add additional functionality to an existing C program that parses alignment files but I've run into some issues with the parsing of a "mis match" string that the program creates. To add some context, here is an example of the alignment string:

example = "28G11AC10T32";

Here is how to interpret the string: the first 28 bases match the sequence, then there is a "G" mismatch (29th base total), the next 11 bases match (40th base total), an "A" mismatch (41st base total), "C" mismatch (42nd base total), and so on...

I need to find out the base position where there are mismatches (ie, the string has a character instead of digits) and store that into an int array so that I can look this up in a later subroutine.

So here is where my issue comes into play. I have written a subroutine that I "thought" could parse this out, but I get a very strange artifact from the output. NOTE: please forgive my terrible and cluttered code! I am not a C programmer by any means and my training is not in computer science!

int errorPosition(char *mis_match, int *positions){
    int i = 0; //iterator for loop
    int pi = 0; //position array iterator
    int in = 0; //makeshift boolean to tell if values are inside the pre array
    int con = 0; //temporary holder for values converted from the pre array
    char pre[5]; //this array will hold the digit values that will be converted to ints
    pre[0] = '\0';
    for (i = 0; i < strlen(mis_match); i++){
        if(isalpha(mis_match[i]) && in == 1){
            con += atoi(pre);   // this is the part where I get an artifact (see below)
            positions[pi] = con;
            con++;
            pi++;
            in = 0;
            memset(&pre[0], 0, sizeof(pre));
            pri = 0;
        }else if(isalpha(mis_match[i]) && in == 0){
            positions[pi] = con;
            con++;
            pi++;
        }else if(isdigit(mis_match[i])){
            pre[pri] = mis_match[i];
            pri++;
            in = 1;
        }
    }
    if(pri > 0){
        con += atoi(pre);
        positions[pi] = con;
        pi++;
    }

}

So, my issue is that when I reach the segment that I have commented above ("this is where I get the error"), my "pre" string contains the digits times 10. For example, using the example string I listed above, the first time that the loop would reach that area I would expect that pre would contain "28", but instead it contains "280"! When I use atoi to convert the string, it is therefore ten times higher than I expect.
Is there something that I am missing or some char array convention in C that I am ignorant of here? Thank you in advance for your replies.


Solution

  • This may not be the only issue, but you are not zero-terminating the string that you pass to atoi. The '0' character in the third position of 280 may be garbage, because you never wrote to that position of the array.

    To address this issue, you should add this line before the call of atoi:

    pre[pri] = '\0';