Search code examples
lucene

Seaching for product codes, phone numbers in Lucene


I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?

I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:

public class IdentifierAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new KeywordTokenizer();
        TokenStream tokenStream = new LowerCaseFilter(tokenizer);
        tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
        tokenStream = new TrimFilter(tokenStream);
        return new TokenStreamComponents(tokenizer, tokenStream);
    }

    @Override
    protected TokenStream normalize(String fieldName, TokenStream in) {
        TokenStream tokenStream = new LowerCaseFilter(in);
        tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
        tokenStream = new TrimFilter(tokenStream);
        return tokenStream;
    }
}

So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.


Solution

  • This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.

    Here is my custom analyzer, using a WordDelimiterGraphFilter:

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.Tokenizer;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.core.KeywordTokenizer;
    import org.apache.lucene.analysis.core.LowerCaseFilter;
    import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
    import java.util.Map;
    import java.util.HashMap;
    
    public class IdentifierAnalyzer extends Analyzer {
    
        private WordDelimiterGraphFilterFactory getWordDelimiter() {
            Map<String, String> settings = new HashMap<>();
            settings.put("generateWordParts", "1");   // e.g. "PowerShot" => "Power" "Shot"
            settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
            settings.put("catenateAll", "1");         // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
            settings.put("preserveOriginal", "1");    // e.g. "500-42" => "500" "42" "500-42"
            settings.put("splitOnCaseChange", "1");   // e.g. "fooBar" => "foo" "Bar"
            return new WordDelimiterGraphFilterFactory(settings);
        }
    
        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new KeywordTokenizer();
            TokenStream tokenStream = new LowerCaseFilter(tokenizer);
            tokenStream = getWordDelimiter().create(tokenStream);
            return new TokenStreamComponents(tokenizer, tokenStream);
        }
        
        @Override
        protected TokenStream normalize(String fieldName, TokenStream in) {
            TokenStream tokenStream = new LowerCaseFilter(in);
            return tokenStream;
        }
    
    }
    

    It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.

    You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.

    Here is a test index builder for the following 3 input values:

    978-3-86680-192-9
    TS 123
    123.abc
    
    public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
        final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
        Analyzer analyzer = new IdentifierAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
        iwc.setOpenMode(OpenMode.CREATE);
        Document doc;
    
        List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
    
        try (IndexWriter writer = new IndexWriter(dir, iwc)) {
            for (String identifier : identifiers) {
                doc = new Document();
                doc.add(new TextField("identifiers", identifier, Field.Store.YES));
                writer.addDocument(doc);
            }
        }
    }
    

    This creates the following tokens:

    enter image description here

    For querying the above indexed data I used this:

    public static void doSearch() throws IOException, ParseException {
        Analyzer analyzer = new IdentifierAnalyzer();
        QueryParser parser = new QueryParser("identifiers", analyzer);
    
        List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
    
        for (String search : searches) {
            Query query = parser.parse(search);
            printHits(query, search);
        }
    }
    
    private static void printHits(Query query, String search) throws IOException {
        System.out.println("search term: " + search);
        System.out.println("parsed query: " + query.toString());
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs results = searcher.search(query, 100);
        ScoreDoc[] hits = results.scoreDocs;
        System.out.println("hits: " + hits.length);
        for (ScoreDoc hit : hits) {
            System.out.println("");
            System.out.println("  doc id: " + hit.doc + "; score: " + hit.score);
            Document doc = searcher.doc(hit.doc);
            System.out.println("  identifier: " + doc.get("identifiers"));
        }
        System.out.println("-----------------------------------------");
    }
    

    This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):

    9783
    9783*
    978 3
    978-3
    TS1*
    TS 1*
    

    The only query which failed to find any matching documents was the first one:

    search term: 9783
    parsed query: identifiers:9783
    hits: 0
    

    This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.

    The final query I tested TS 1* found three hits - but the one we want has the best matching score:

    search term: TS 1*
    parsed query: identifiers:ts identifiers:1*
    hits: 3
    
      doc id: 1; score: 1.590861
      identifier: TS 123
    
      doc id: 0; score: 1.0
      identifier: 978-3-86680-192-9
    
      doc id: 2; score: 1.0
      identifier: 123.abc