Search code examples
cstringsearchstructure

Checking and comparing structure elements to a user input in C


I am trying to make a function in C that searches in a database which consists of clients information, compares that to some user input, and eventually prints out good results based on a condition that relies on a Livenshtein distance percentage.

typedef struct Person_t
{
    char Name[32];
    char Email[64];
    char City[96];
    char Country[64];
} Person;  

Here is the problem, the user is asked to enter each of the information represented by the structure members, BUT he can only enter what he wants, for example: Name: Emily , Email: (no user input aka left empty), City: (left empty) , Country: France.

And the goal here is to only compare these attributes (Emily, Country) to the respective ones in the database, because: As I said, the comparison algorithm is based on a condition related to the Livenshtein Distance, and if that condition doesn't pass say 50, it won't validate the search as a "good result" which might exclude good cases like, let's say:

Name: Jack, Email: jack.jack@gmail.com , City: Birmingham, Country: United States

This client exists in our "big" database (assuming we have a lot of Jacks and a lot of people from Birmingham), and smbd is making a search only knowing few things about Jack, so he enters:

Name: Jackk , Email: (empty) , City: (empty) , Country: United Statesss

(I am messing up the spelling intentionally to show how the LD can be useful, it will still score high in the percentage so don't worry about that)

The percentage condition using LD: percentage = (1 - LD/max(string1, string2)) * 100 ,we're setting the cutoff on 50%.

The problem here is that if we compare each and every structure member with its corresponding one, one that pertains to what the user enters, we are reducing the likeliness of it scoring a good percentage, because comparing an empty string to an "existent" one will make it so that Livenshtein distance is big, and thus the percentage is low and that will take away a lot of good results.

It's important to note, that I don't want to use an OR . I don't know want that if he gets the "Jack" right, it will pass through (percentage = 100), that will not be efficient due to the existent of too many Jacks (we're talking about a big database), so I'm definitely working with an AND, to make sure all the user input is as close as possible to what he wants, and at the same time minimizing the number of results.

Bear in mind that beyond that, the post-search results are gonna be sorted based on the percentages, so dealing with the blank strings needs to be thoroughly done in accordance to that.


Solution

  • Here's an example of what I meant with my comment.

    The compare_field function returns 1 if there was a valid match at all and populates the percentage out argument; it returns 0 if that field didn't matter.

    The compare_record function calls that function 4 times, once for each field, and averages the percentages returned by the field comparison function.

    #include <stdio.h>
    #include <string.h>
    
    typedef struct Person_t {
      char Name[32];
      char Email[64];
      char City[96];
      char Country[64];
    } Person;
    
    static Person database[] = {
        {.Name = "Emily",
         .Email = "emi@x.ly",
         .City = "Paris",
         .Country = "France"},
        {.Name = "Jack",
         .Email = "jack.jack@gmail.com",
         .City = "Birmingham",
         .Country = "United States"},
        {.Name = "Jank",
         .Email = "jankjank@gmail.com",
         .City = "London",
         .Country = "United Kingdom"},
    };
    
    static int compare_field(const char *record, const char *query,
                             float *percentage) {
      if (strlen(record) == 0 || strlen(query) == 0) {
        return 0; // Ignore this field in the search
      }
      // TODO: implement real levenshtein distance here
      int distance = 0;
      for (int i = 0;; i++) {
        if (record[i] == 0 || query[i] == 0)
          break;
        if (record[i] != query[i])
          distance++;
      }
      *percentage = (1.0f - distance / (float)strlen(record)) * 100.0f;
      return 1;
    }
    
    static float compare_record(const Person *record, const Person *query) {
      float total_match_percentage = 0, temp_percentage;
      int total_match_fields = 0;
    
    
      #define COMPARE_FIELD(field)  if (compare_field(record->field, query->field, &temp_percentage)) { total_match_percentage += temp_percentage; total_match_fields++; }
    
      COMPARE_FIELD(Name);
      COMPARE_FIELD(Email);
      COMPARE_FIELD(City);
      COMPARE_FIELD(Country);
    
      #undef COMPARE_FIELD
    
      return total_match_percentage / (float)total_match_fields;
    }
    
    int main() {
      int n_database = sizeof(database) / sizeof(Person);
      Person query = {.Name = "Jamk", .City = "Pondon", .Country = "United K"};
      for (int i = 0; i < n_database; i++) {
        float match = compare_record(&database[i], &query);
        printf("%s: %.2f\n", database[i].Name, match);
      }
    }
    

    With the query in the source, this prints out (since Emily still matches for Paris and I didn't implement a full Levenshtein distance algorithm)...

    Emily: 13.33
    Jack: 72.44
    Jank: 86.11