Search code examples
csortingcomparecharacterlexicographic

lexicographically sort in C


I don't understand how works this sort.Why is it so? And how we compare characters in C , how can I understand which of them less or greater than other? It's from book.

Call the function compareStrings and have it return the value –1 if the first string is lexicographically less than the secondstring, 0 if the two strings are equal, and 1 if the first string is lexicographically greater than the second string. So, the function call

compareStrings ("alpha", "altered")

returns the value –1 because the first string is lexicographically less than the second string (think of this to mean that the first string occurs before the second string in a dictionary).And, the function call

compareStrings ("zioty", "yucca");

returns the value 1 because “zioty” is lexicographically greater than “yucca.”

#include <stdio.h>
#include <stdbool.h>

struct entry
{
    char word[15];
    char definition[50];
};

bool equalStrings(char s1[], char s2[])
{
    bool areEqual = false;
    int i = 0;

    while(s1[i] != '\0' && s2[i] != '\0' && s1[i] == s2[i])
        i++;

    if(s1[i] == '\0' && s2[i] == '\0')
        areEqual = true;
    else
        areEqual = false;

    return areEqual;
}

int lookup (struct entry dictionary[], char search[],int entries)
{
    int low = 0;
    int high = entries - 1;
    int mid, result;

    while (low <= high)
    {
        mid = (low + high) / 2;
        result = compareStrings (dictionary[mid].word, search);

        if(result == -1)
            low = mid + 1;
        else if(result == 1)
            high = mid - 1;
        else 
            return mid;
    }

    return -1;
}

int compareStrings(char s1[], char s2[])
{
    int i = 0, answer;

    while(s1[i] == s2[i] && s1[i] != '\0' && s2[i] != '\0')
        i++;

    if(s1[i] < s2[i])
        answer = -1;
    else if(s1[i] == s2[i])
        answer = 0;
    else 
        answer = 1;

    return answer;
}

int main()
{
    struct entry dictionary[100] = 
    {
        {"aardvark", "a burrowing African mammal" },
        {"acumen", " a bottomless pit  "},
        {"addle", "to become confused "},
        {"agar", "a jelly made from seaweed" }
        {"affix", "to append; attach"}
    };

    char word[15];
    int entries = 10;
    int entry;

    printf("Enter word: ");
    scanf("%14s", word);
    entry = lookup (dictionary, word, entries);

    if(entry != -1)
        printf("%s\n", dictionary[entry].definition);
    else
        printf("Sorry, the word %s is not in my dictionary.\n", word);

    return 0;

}

How understand that one greater than another? Thanks.


Solution

  • Lexicographical order is like dictionary order, a comes before b... with the additional behavior that case matters, uppercase coming before lowercase, and non alphabetic characters also compare based on their encoding.

    This comparison is performed character by character. If all characters in both strings are equal, the strings compare equal, the return value is 0, otherwise the relative order is determined by the comparison of the first mismatched character. If s1[i] < s2[i], string s1 comes before s2, the return value is negative, otherwise s1 comes after s2, the return value is positive.

    Note however that this book implementation is sub-optimal and potentially incorrect:

    • if s1[i] == s2[i] and s1[i] != '\0', comparing s2[i] to '\0' is redundant.
    • characters should be compared as unsigned char so extended characters come after regular characters. strcmp is specified this way.

    Here is a simplified version:

    int compareStrings(const char *s1, const char *s2) {
        size_t i;
    
        for (i = 0; s1[i] == s2[i]; i++) {
            if (s1[i] == '\0')
                return 0;
        }
        if ((unsigned char)s1[i] < (unsigned char)s2[i])
            return -1;
        else
            return 1;
    }
    

    Further notes:

    • the dictionary defined in main is not sorted, agar should come after affix. The lookup function relies on the array of entry structures being sorted. It might not find the correct entry for some input words.

    • the lookup function uses a binary search algorithm. This method is very efficient, but the implementation has a classic bug: mid = (low + high) / 2; can overflow for large arrays, causing undefined behavior. It should be written:

      mid = low + (high - low) / 2;