Search code examples
performancetexttext-mining

Get length statistics over very large textual file


I have a very large file (1.5M rows), containing json dictionaries in each row. Each row contains a parsed Wikipedia article. For example

{"title": "article title", "summary": "this is a summary of around 500 words", "article": "This is the whole article with more than 3K words"}
{"title": "article2 title", "summary2": "this is another summary of around 500 words", "article": "This is another whole article with more than 3K words"}

Note that the file is not itself a json.

I want to compute some statistics on these texts, e.g mean number of sentences, mean number of words, compression ratio etc. However, everything I try takes ages. What is the fastest way to go with this? For reference, at the moment I am using spacy for word and sentence tokenization, but I am open to more approximate solutions e.g. using regex, if they are the only way.


Solution

  • If you want to achieve high performance, then you should probably compute the lines in parallel using multiple threads and each line should extract the target metric using a SIMD-friendly code. It is also probably a good idea to simplify the parsing by using a specialized code working only on this problem and not general parsing tool (like regular expression, unless the target engine is capable of producing a very fast linear-time JIT-compiled efficient code).

    For the multithreading part, this is certainly the easiest part since the computation appear to be mostly embarrassingly parallel. Each thread compute the target metrics on a chunks of lines and can then perform a parallel reduction (ie. sum) of the target metrics.

    Each line can be parsed relatively quickly using SimdJson. Since the JSON documents are small and the structure appear to be simple and always the same, you can use a regular expression to search for "article" *: *"((?:[^"\]+|\")*)" (note that you may need to escape the backslash regarding the language used). However, the best strategy is probably to parse yourself the JSON document to extract the wanted string much more efficiently for example by searching for some very specific key pattern/string like " (with a SIMD-friendly loop) followed by article and then parse the rest using a more robust (but slower) method.

    Similar strategies apply to count words. A fast over-approximation is to count the number of space directly followed by a character. The string encoding matters to speed up parsing as decoding UTF-8 string is generally pretty slow. One fast solution is to just discard non-ASCII character if the target language is English or mostly use ASCII characters. If this is not the case, then you can use some SIMD-aware UTF-8 decoding library (or hand written algorithm in the worst case). Working on small chunks of about 1KB can help to use the CPU cache more efficiently (and to auto-vectorize your code if you use a compiled native language like C or C++).

    If you are not very familiar with SIMD instructions, or low-level parsing strategies/algorithms (like this one), note that there are some fast parsing libraries to do basic operation efficiently like Hyperscan.