Seaching for product codes, phone numbers in Lucene

I'm looking for general advice how to search identifiers, product codes or phone numbers in Apache Lucene 8.x. Let's say I'm trying to to search lists of product codes (like an ISBN, for example 978-3-86680-192-9). If somebody enters 9783 or 978 3 or 978-3, 978-3-86680-192-9 should appear. Same should happen if an identifier uses any combinations of letters, spaces, digits, punctuation (examples: TS 123, 123.abc. How would I do this?

I thought I could solve this with a custom analyzer that removes all the punctuation and whitespace, but the results are mixed:

public class IdentifierAnalyzer extends Analyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new KeywordTokenizer();
        TokenStream tokenStream = new LowerCaseFilter(tokenizer);
        tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
        tokenStream = new TrimFilter(tokenStream);
        return new TokenStreamComponents(tokenizer, tokenStream);
    }

    @Override
    protected TokenStream normalize(String fieldName, TokenStream in) {
        TokenStream tokenStream = new LowerCaseFilter(in);
        tokenStream = new PatternReplaceFilter(tokenStream, Pattern.compile("[^0-9a-z]"), "", true);
        tokenStream = new TrimFilter(tokenStream);
        return tokenStream;
    }
}

So while I get the desired results when performing a PrefixQuery with TS1*, TS 1* (with whitespace) does not yield satisfactory results. When I look into the parsed query, I see that Lucene splits TS 1* into two queries: myField:TS myField:1*. WordDelimiterGraphFilter looks interesting, but I couldn't figure out to apply it here.

Solution

This is not a comprehensive answer - but I agree that WordDelimiterGraphFilter may be helpful for this type of data. However, there could still be test cases which need additional handling.

Here is my custom analyzer, using a WordDelimiterGraphFilter:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;

public class IdentifierAnalyzer extends Analyzer {

    private WordDelimiterGraphFilterFactory getWordDelimiter() {
        Map<String, String> settings = new HashMap<>();
        settings.put("generateWordParts", "1");   // e.g. "PowerShot" => "Power" "Shot"
        settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
        settings.put("catenateAll", "1");         // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
        settings.put("preserveOriginal", "1");    // e.g. "500-42" => "500" "42" "500-42"
        settings.put("splitOnCaseChange", "1");   // e.g. "fooBar" => "foo" "Bar"
        return new WordDelimiterGraphFilterFactory(settings);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new KeywordTokenizer();
        TokenStream tokenStream = new LowerCaseFilter(tokenizer);
        tokenStream = getWordDelimiter().create(tokenStream);
        return new TokenStreamComponents(tokenizer, tokenStream);
    }
    
    @Override
    protected TokenStream normalize(String fieldName, TokenStream in) {
        TokenStream tokenStream = new LowerCaseFilter(in);
        return tokenStream;
    }

}

It uses the WordDelimiterGraphFilterFactory helper, together with a map of parameters, to control which settings are applied.

You can see the complete list of available settings in the WordDelimiterGraphFilterFactory JavaDoc. You may want to experiment with setting/unsetting different ones.

Here is a test index builder for the following 3 input values:

978-3-86680-192-9
TS 123
123.abc

public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
    final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
    Analyzer analyzer = new IdentifierAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    Document doc;

    List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");

    try (IndexWriter writer = new IndexWriter(dir, iwc)) {
        for (String identifier : identifiers) {
            doc = new Document();
            doc.add(new TextField("identifiers", identifier, Field.Store.YES));
            writer.addDocument(doc);
        }
    }
}

This creates the following tokens:

For querying the above indexed data I used this:

public static void doSearch() throws IOException, ParseException {
    Analyzer analyzer = new IdentifierAnalyzer();
    QueryParser parser = new QueryParser("identifiers", analyzer);

    List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");

    for (String search : searches) {
        Query query = parser.parse(search);
        printHits(query, search);
    }
}

private static void printHits(Query query, String search) throws IOException {
    System.out.println("search term: " + search);
    System.out.println("parsed query: " + query.toString());
    IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
    IndexSearcher searcher = new IndexSearcher(reader);
    TopDocs results = searcher.search(query, 100);
    ScoreDoc[] hits = results.scoreDocs;
    System.out.println("hits: " + hits.length);
    for (ScoreDoc hit : hits) {
        System.out.println("");
        System.out.println("  doc id: " + hit.doc + "; score: " + hit.score);
        Document doc = searcher.doc(hit.doc);
        System.out.println("  identifier: " + doc.get("identifiers"));
    }
    System.out.println("-----------------------------------------");
}

This uses the following search terms - all of which I pass into the classic query parser (though you could, of course, use more sophisticated query types via the API):

9783
9783*
978 3
978-3
TS1*
TS 1*

The only query which failed to find any matching documents was the first one:

search term: 9783
parsed query: identifiers:9783
hits: 0

This should not be a surprise, since this is a partial token, without a wildcard. The second query (with the wildcard added) found one document, as expected.

The final query I tested TS 1* found three hits - but the one we want has the best matching score:

search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3

  doc id: 1; score: 1.590861
  identifier: TS 123

  doc id: 0; score: 1.0
  identifier: 978-3-86680-192-9

  doc id: 2; score: 1.0
  identifier: 123.abc