Search code examples
search-engineinformation-retrievallemur

Get vocabulary list in Galago


I am using Galago retrieval toolkit (a part of the Lemur project) and I need to have a list of all vocabulary terms in the collection (all unique terms). Actually I need a List <String> or Set <String> I really appreciate to let me know how can I obtain such a list?


Solution

  • The `DumpKeysFn' class seems to give all the keys (unique terms) of the collection. The code should be like this:

    public static Set <String> getAllVocabularyTerms (String fileName) throws IOException{
        Set <String> result = new HashSet<> ();
        IndexPartReader reader = DiskIndex.openIndexPart(fileName);
        if (reader.getManifest().get("emptyIndexFile", false)) {
            // do something!
        }
    
        KeyIterator iterator = reader.getIterator();
        while (!iterator.isDone()) {
          result.add(iterator.getKeyString());
          iterator.nextKey();
        }
        reader.close();
        return result;
    }