Search code examples
cstringsubstringunique

How to avoid duplicates when finding all k-length substrings


I want to display all substrings with k letters, one per line, but avoid duplicate substrings. I managed to write to a new string all the k length words with this code:

void subSent(char str[], int k) {
    int MaxLe, i, j, h, z = 0, Length, count;
    char stOu[1000] = {'\0'};
    
    Length = (int)strlen(str);
    MaxLe = maxWordLength(str);

    if((k >= 1) && (k <= MaxLe)) {
        for(i = 0; i < Length; i++) {
            if((int)str[i] == 32) {
                j = i = i + 1;
            } else {
                j = i;
            }

            for(; (j < i + k) && (Length - i) >= k; j++) {
                if((int)str[j] != 32) {
                    stOu[z] = str[j];
                } else {
                    stOu[z] = str[j + 1];
                }
                z++;
            }
            stOu[z] = '\n';
            z++;
        }
    }
}

But I'm struggling with the part that needs to save only one time of a word.

For example, the string HAVE A NICE DAY and k = 1 it should print:

H
A
V
E
N
I
C
D
Y

Solution

  • Your subSent() routine poses a couple of challenges: first, it neither returns nor prints it's result -- you can only see it in the debugger; second it calls maxWordLength() which you didn't supply.

    Although avoiding duplicates can be complicated, in the case of your algorithm, it's not hard to do. Since all your words are fixed length, we can walk the output string with the new word, k letters (plus a newline) at a time, doing strncmp(). In this case the new word is the last word added so we quit when the pointers meet.

    I've reworked your code below and added a duplication elimination routine. I didn't know what maxWordLength() does so I just aliased it to strlen() to get things running:

    #include <stdio.h>
    #include <string.h>
    #include <stdbool.h>
    
    #define maxWordLength strlen
    
    // does the last (fixed size) word in string appear previously in string
    bool isDuplicate(const char *string, const char *substring, size_t n) {
        for (const char *pointer = string; pointer != substring; pointer += (n + 1)) {
            if (strncmp(pointer, substring, n) == 0) {
                return true;
            }
        }
    
        return false;
    }
    
    void subSent(const char *string, int k, char *output) {
        int z = 0;
    
        size_t length = strlen(string);
        int maxLength = maxWordLength(string);
    
        if (k >= 1 && k <= maxLength) {
            for (int i = 0; i < length - k + 1; i++) {
    
                int start = z;  // where does the newly added word begin
    
                for (int j = i; (z - start) < k; j++) {
    
                    output[z++] = string[j];
                    while (string[j + 1] == ' ') {
                        j++;  // assumes leading spaces already dealt with
                    }
                }
    
                output[z++] = '\n';
    
                if (isDuplicate(output, output + start, k)) {
    
                    z -= k + 1;  // last word added was a duplicate so back it out
    
                }
    
                while (string[i + 1] == ' ') {
                    i++;  // assumes original string doesn't begin with a space
                }
            }
        }
    
        output[z] = '\0';  // properly terminate the string
    }
    
    int main() {
        char result[1024];
    
        subSent("HAVE A NICE DAY", 1, result);
    
        printf("%s", result);
    
        return 0;
    }
    

    I somewhat cleaned up your space avoidance logic but it can be tripped by leading spaces on the input string.

    OUTPUT

    subSent("HAVE A NICE DAY", 1, result);
    
    H
    A
    V
    E
    N
    I
    C
    D
    Y
    
    subSent("HAVE A NICE DAY", 2, result);
    
    HA
    AV
    VE
    EA
    AN
    NI
    IC
    CE
    ED
    DA
    AY
    
    subSent("HAVE A NICE DAY", 3, result);
    
    HAV
    AVE
    VEA
    EAN
    ANI
    NIC
    ICE
    CED
    EDA
    DAY