Search code examples
pseudocodesentence

Algorithm to see if an address has been written in a sentence


I want to make an algorithm which can see if an address is written in a sentence.

For instance, if a user writes:

"Hi, my address is Lincolnstreet 27, Foobarcity. Can you pick up the package there?"

And the user's address is Lincolnstreet 27, Foobarcity, then I want an algorithm that can detect that the address was mentioned in the sentence.

I already know the user's street name and number, zip code and city name.

It also needs to be fuzzy, in that people can make typos or make slight variations of their address that they wrote in the sentence. However, it's not required that the algorithm catches all occurences always no matter how mistyped they are, since that is obviously impossible. It's okay with a semi-naive solution.

I looked into Levehnstein distance, but I can't figure out how to make it work for this exact scenario. I also looked into Longest Common Subsequence, and it's the same problem there.

Any ideas? I don't necessarily care about the programming language.

I am not interested in a neural net solution - I truly believe it should be solvable with a relatively naive algorithm - I just don't know where to start.


Solution

  • Taking the sentence as the larger string, you basically want to see the following:

    • street name is present
    • city name is present
    • street number is present

    You can check for order if you care to, but you want it to be fuzzy, so we'll ignore that for now. It may be prudent to check for overlap, which you can do by looking at the start and end of the substrings and comparing them.

    Your language of choice almost certainly has some sort of .contains() function, and it likely has a fuzzy mode.

    In which case,

    if (sentence.roughly_contains(streetname) and sentence.roughly_contains(cityname) and sentence.contains(streetnumber)) {
        return true;
    }
    

    if you can't find a fuzzy match function, write one! Fuzzy Text Matching C# provides us with https://blogs.msdn.microsoft.com/toub/2006/05/05/generic-levenshtein-edit-distance-with-c/ which gives us a good generic implementation of fuzzy search that you can use to make a .roughly_contains() function.

    Order wise; the check roughly follows the pattern:

    //where all string.[start|end] are integers, locations can be found trivially or with the help of google once you know their presence
    overlap(string1, string2) {
        if (string1.start > string2.end || string1.end < string2.start) {
            return false;
        }
        else {
            return true
        }
    }
    

    (this is assuming that you know the addresses independently of the sentence)