I am using Whoosh to index and search a variety of texts in various encodings. When performing a search on my indexed files, though, some of the matching results are not appearing in the output employing the "highlighting" feature. I have a feeling this is related to encoding errors, but I can't figure out what might prevent all results from displaying. I would be very grateful for any light others can shed on this mystery.
Here is the script I am using to create my index, and here are the files I am indexing:
from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet
encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']
def determine_string_encoding(string):
result = chardet.detect(string)
string_encoding = result['encoding']
return string_encoding
#specify a list of paths that contain all of the texts we wish to index
text_dirs = [
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"
]
#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))
#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()
#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:
#for each directory in our list
for i in text_dirs:
#for each text file in that directory (j is now the path to the current file within the current directory)
for j in glob.glob( i + "\\*.txt" ):
#first, let's grab j title. If the title is stored in the text file name, we can use this method:
text_title = j.split("\\")[-1]
#now let's read the file
with open( j, "r" ) as text_content:
text_content = text_content.read()
#use method defined above to determine encoding of path and text_content
path_encoding = determine_string_encoding(j)
text_content_encoding = determine_string_encoding(text_content)
#because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
if "clean" in j:
text_content_encoding = "iso-8859-1"
#decode text_title, path, and text_content to unicode using the encodings we determined for each above
unicode_text_title = unicode(text_title, path_encoding)
unicode_text_path = unicode(j, path_encoding)
unicode_text_content = unicode(text_content, text_content_encoding)
#use writer method to add document to index
writer.add_document( title = unicode_text_title, path = unicode_text_path, content = unicode_text_content )
#after you've added all of your documents, commit changes to the index
writer.commit()
That code seems to index the texts without any problems, but when I use the following script to parse the index, I get three blank values in the out.txt output file--the first two rows are empty, and row six is empty, but I expect those three lines to be non-empty. Here is the script I'm using to parse the index:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
#to enable Levenshtein-based parse, use plugin
parser.add_plugin(FuzzyTermPlugin())
#using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
query = parser.parse(u"swallow~2/3")
results = searcher.search(query)
#see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)
#write query results to disk or html
with codecs.open("out.txt","w") as out:
for i in results[0:]:
title = i["title"]
highlight = i.highlights("content")
clean_highlight = " ".join(highlight.split())
out.write(clean_highlight.encode("utf-8") + "\n")
If anyone can suggest reasons why those three rows are empty, I would be eternally grateful.
Holy Moly, I might have figured this out! It seems some of my text files (including both of the files with "hume" in the path) surpassed a threshold that governs Whoosh's index creation behavior. If one tries to index a file that's too large, Whoosh appears to store that text as a string value, rather than a unicode value. So, assuming one has an index with fields "path" (path to file), "title" (title of file), "content" (content of file) and "encoding" (the encoding of current file), one can test whether the files in that index are properly indexed by running a script like the following:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
phrase_to_search = unicode("swallow")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
query = parser.parse( phrase_to_search )
results = searcher.search(query)
for hit in results:
hit_encoding = (hit["encoding"])
with codecs.open(hit["path"], "r", hit_encoding) as fileobj:
filecontents = fileobj.read()
hit_highlight = hit.highlights("content", text=filecontents)
hit_title = (hit["title"])
print type(hit_highlight), hit["title"]
If any of the printed values have type "str", then it seems that the highlighter is treating part of the designated file as type string rather than unicode.
Here are two ways to rectify this problem: 1) Split your large files (anything over 32K characters) into smaller files--all of which should contain < 32K characters--and index those smaller files. This approach requires more curation but ensures reasonable processing speed. 2) Pass a parameter to your results variable to increase the maximum number of characters that can be stored as unicode and thus, in the example above, printed to terminal correctly. To implement this solution in the code above, one can add the following line after the line that defines results
:
results.fragmenter.charlimit = 100000
Adding this line allows one to print any results from the first 100000 characters of the designated file to terminal, though it significantly increases processing time. Alternatively, one can remove the character limit altogether using results.fragmenter.charlimit = None
though this really increases processing time when working with large files...