Search code examples
calgorithmedit-distance

Edit Distance Matrix


I'm trying to build a program that takes two strings and fills in the edit distance matrix for them. The thing that is tripping me up is, for the second string input, it is skipping over the second input. I've tried clearing the buffer with getch(), but it didn't work. I've also tried switching over to scanf(), but that resulted in some crashes as well. Help please!

Code:

#include <stdio.h>
#include <stdlib.h>

int min(int a, int b, int c){
    if(a > b && a > c)
        return a;
    else if(b > a && b > c)
        return b;
    else
        return c;
}

int main(){

    // allocate size for strings
    int i, j;
    char *input1 = (char*)malloc(sizeof(char)*100);
    char *input2 = (char*)malloc(sizeof(char)*100);

    // ask for input
    printf("Enter the first string: ");
    fgets(input1, sizeof(input1), stdin);
    printf("\nEnter the second string: ");
    fgets(input2, sizeof(input2), stdin);

    // make matrix
    int len1 = sizeof(input1), len2 = sizeof(input2);
    int c[len1 + 1][len2 + 1];

    // set up input 2 length
    for(i = 0; i < len2 + 1; i++){
        c[0][i] = i;
    }

    // set up input 1 length
    for(i = 0; i < len1 + 1; i++){
        c[i][0] = i;
    }

    // fill in the rest of the matrix
    for(i = 1; i < len1; i++){
        for(j = 1; j < len2; j++){
            if(input1[i] == input2[j]) // if the first letters are equal make the diagonal equal to the last
                c[i][j] = c[i - 1][j - 1];
            else
                c[i][j] = 1 + min(c[i - 1][j - 1], c[i - 1][j], c[i][j - 1]);
        }
    }

    // print the matrix
    printf("\n");
    for(j = 0; j < len2; j++){
        for(i = 0; i < len1; i++){
            printf("|  %d", c[i][j]);
        }
        printf("\n");
    }

    return 1;
}

Solution

  • Stick with fgets.

    As others have pointed out, use char input1[100] instead of char *input1 = malloc(...)

    But, even with that change, which makes the sizeof inside of the fgets correct, using sizeof when setting up len1 and len2 is wrong. You'll be processing an entire buffer of 100, even if their are only 10 valid characters in it (i.e. the remaining ones are undefined/random).

    What you [probably] want is strlen [and a newline strip] to get the actual useful lengths.

    Here's the modified code [please pardon the gratuitous style cleanup]:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    int
    min(int a, int b, int c)
    {
        if (a > b && a > c)
            return a;
        if (b > a && b > c)
            return b;
        return c;
    }
    
    int
    main(void)
    {
    
        // allocate size for strings
        int i;
        int j;
        char input1[100];
        char input2[100];
    
        // ask for input
        printf("Enter the first string: ");
        fgets(input1, sizeof(input1), stdin);
        int len1 = strlen(input1);
        if (input1[len1 - 1] == '\n') {
            input1[len1 - 1] = 0;
            --len1;
        }
    
        printf("\nEnter the second string: ");
        fgets(input2, sizeof(input2), stdin);
        int len2 = strlen(input2);
        if (input2[len2 - 1] == '\n') {
            input2[len2 - 1] = 0;
            --len2;
        }
    
        // make matrix
        int c[len1 + 1][len2 + 1];
    
        // set up input 2 length
        for (i = 0; i < len2 + 1; i++) {
            c[0][i] = i;
        }
    
        // set up input 1 length
        for (i = 0; i < len1 + 1; i++) {
            c[i][0] = i;
        }
    
        // fill in the rest of the matrix
        for (i = 1; i < len1; i++) {
            for (j = 1; j < len2; j++) {
                // if the 1st letters are equal make the diagonal equal to the last
                if (input1[i] == input2[j])
                    c[i][j] = c[i - 1][j - 1];
                else
                    c[i][j] = 1 + min(c[i - 1][j - 1], c[i - 1][j], c[i][j - 1]);
            }
        }
    
        // print the matrix
        printf("\n");
        for (j = 0; j < len2; j++) {
            for (i = 0; i < len1; i++) {
                printf("|  %d", c[i][j]);
            }
            printf("\n");
        }
    
        return 1;
    }
    

    UPDATE:

    Okay sweet I see what you mean! The reason I was trying to use malloc though was to avoid making the matrix that I had to print a size of 100x100 blank spaces.

    With either the fixed size input1 or the malloced one, fgets will only fill it to the input size entered [clipped to the second argument, if necessary]. But, it does not pad/fill the remainder of the buffer with anything (e.g. spaces on the right). What it does do is add an EOS [end-of-string] character [which is a binary 0x00] after the last char read from input [which is usually the newline].

    Thus, if the input string is: abcdef\n, the length [obtainable from strlen] is 7, input[7] will be 0x00, and input1[8] through input1[99] will have undefined/random/unpredictable values and not spaces.

    Since a newline char isn't terribly useful, it is often stripped out before further processing. For example, it isn't terribly relevant when computing edit distance for a small phrase.

    Does using strlen() only count the number of chars inside the array, or does it include all the blank spaces too?

    As I mentioned above, fgets does not pad the string at the end, so, not to worry. It will do what you want/expect.

    strlen only counts chars up to [but not including the EOS terminator character (i.e.) zero]. If some of these chars happen to be spaces, they will be counted by strlen--which is what we want.

    Consider computing the edit distance between any two of the following phrases:

    quick brown fox jumped over the lazy dogs
    the quick brown fox jumped over lazy dogs
    quick brown fox jumps over the lazy dog
    

    In each case, we want strlen to include the [internal/embedded] spaces in the length calculation. That's because it is perfectly valid to compute the edit distance of phrases.


    There is a valid usage for malloc: when the amount of data is too big to fit on the stack. Most systems have a default limit (e.g. under linux, it's 8 MB).

    Suppose we were computing the edit distance for two book chapters [read from files], we'd have (e.g.):

    char input1[50000];
    char input2[50000];
    

    The above would fit, but the c matrix would cause a stack overflow:

    int c[50000][50000];
    

    because the size of this would be 50000 * 50000 * 4 which is approx 9.3 GB.

    So, to fit all this data, we'd need to allocate it on the heap. While it is possible to do a malloc for c and maintain the 2D matrix access, we'd have to create a function and pass off the pointer to c to it.

    So, here's a modified version that takes input of arbitrarily large sizes:

    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <string.h>
    
    #define sysfault(_fmt...) \
        do { \
            fprintf(stderr,_fmt); \
            exit(1); \
        } while (0)
    
    #define C(y,x)      c[((y) * (len2 + 1)) + (x)]
    
    long
    min(long a, long b, long c)
    {
        if (a > b && a > c)
            return a;
        if (b > a && b > c)
            return b;
        return c;
    }
    
    char *
    input(const char *prompt,long *lenp,const char *file)
    {
        FILE *fp;
        char *lhs;
        int chr;
        long siz;
        long len;
    
        if (file != NULL)
            fp = fopen(file,"r");
        else {
            fp = stdin;
            printf("Enter %s string: ",prompt);
            fflush(stdout);
        }
    
        lhs = NULL;
        siz = 0;
        len = 0;
    
        while (1) {
            chr = fgetc(fp);
            if (chr == EOF)
                break;
    
            if ((chr == '\n') && (file == NULL))
                break;
    
            // grow the character array
            if ((len + 1) >= siz) {
                siz += 100;
                lhs = realloc(lhs,siz);
                if (lhs == NULL)
                    sysfault("input: realloc failure -- %s\n",strerror(errno));
            }
    
            lhs[len] = chr;
            len += 1;
        }
    
        if (file != NULL)
            fclose(fp);
    
        if (lhs == NULL)
            sysfault("input: premature EOF\n");
    
        // add the EOS
        lhs[len] = 0;
    
        // return the length to the caller
        *lenp = len;
    
        return lhs;
    }
    
    int
    main(int argc,char **argv)
    {
        long i;
        long j;
        char *input1;
        long len1;
        char *input2;
        long len2;
        long *c;
    
        --argc;
        ++argv;
    
        switch (argc) {
        case 2:
            input1 = input("first",&len1,argv[0]);
            input2 = input("second",&len2,argv[1]);
            break;
        default:
            input1 = input("first",&len1,NULL);
            input2 = input("second",&len2,NULL);
            break;
        }
    
        // make matrix
        c = malloc(sizeof(*c) * (len1 + 1) * (len2 + 1));
        if (c == NULL)
            sysfault("main: malloc failure -- %s\n",strerror(errno));
    
        // set up input 2 length
        for (i = 0; i < len2 + 1; i++) {
            C(0,i) = i;
        }
    
        // set up input 1 length
        for (i = 0; i < len1 + 1; i++) {
            C(i,0) = i;
        }
    
        // fill in the rest of the matrix
        for (i = 1; i < len1; i++) {
            for (j = 1; j < len2; j++) {
                // if the 1st letters are equal make the diagonal equal to the last
                if (input1[i] == input2[j])
                    C(i,j) = C(i - 1,j - 1);
                else
                    C(i,j) = 1 + min(C(i - 1,j - 1), C(i - 1,j), C(i,j - 1));
            }
        }
    
        // print the matrix
        printf("\n");
        for (j = 0; j < len2; j++) {
            for (i = 0; i < len1; i++) {
                printf("|  %ld", C(i,j));
            }
            printf("\n");
        }
    
        return 1;
    }