Search code examples
pythonregexpython-2.7linelinecache

How to efficiently read lines of large text file in Python in different orders: a) random line each time, b) starting in middle...?


I have a large text file (5 million to 10 million lines). Each line has 10 to 5,000 alphanumeric items which are separated from each other by a space or comma. Each line ends with a linebreak. The number of lines is known before runtime and the text file doesn't change during runtime.

Each time the code is run, it will be passed 50-200 search lists, each of which contains 2-10 items (words). For each of these search lists, I'd like to find x number of lines in the text file which contain at least one item from that list. The number of lines x ranges from 5-10 lines and is set at runtime. The match is case-insensitive and must be exact on word boundary (e.g., "foo" matches to ",foo " but not to " foobar ").

Each search list has one of three search order strategies:

  • Normal. Start at first line, search line-by-line in consecutive order until it finds x number of lines or reaches the last line.

  • Random. Pick lines at random from the text file. Keep going until it finds x number of lines or has completed search of every line.

  • Bucket range. First search the highest priority range of lines, then the next highest priority range of lines, then the next, etc. For example, the search range priority might be to first look through lines 1,000,000 to 1,499,999, then lines 0 to 999,999, then lines 1,500,000 to 2,199,999, etc. There can be between 3 and 20 buckets, with each covering a range of 100,000 to 5,000,000 lines. The number of buckets and line-number ranges are given at runtime. Within each range, search consecutively. Search until it finds x number of lines or reaches the end of the last bucket.

Here's what I wrote for the "normal search" [this test code checks all lines to the end of the file without stopping after x lines; in the final version, after it finds x matches for a search list, it'll stop and move on to the next search list]:

import re
textFile = "bigTextFile.txt"
searchItems = ["foo","bar","anotherfoo","anotherbar","etc"]
searchStrategy = "normal" #could be "normal", "random", "bucket"; this code is just for "normal"
results = []
with open(textFile) as inFile:
    for line in inFile:
        regex = re.compile('(' + '|'.join('\\b'+item+'\\b' for item in searchItems) +')', re.IGNORECASE)
        match = regex.search(line)  
        if match is not None:
            analysisResult = analyze_the_match(line)
            results.append(analysisResult)

This code for the "normal search" works. Of what I've tried, it seems to be the fastest, but I'm new with Python and I'd guess there must be a better way (speed/efficiency) to do this.

[Update in response to comments to better explain the reason for different search strategies] The items are highly skewed. Playing with the data, I've found that about half the searches will complete before 10,000 lines, 40% could take a few million lines to find enough matches, and 10% won't find enough matches in the whole text file. The items in each line have no relation to the line they're in, but ranges of lines are similar (i.e. lines 100,000-500,000 are related, 1,500,000-1,750,000 are related, etc). The code could be run many times for a given search list, and for each run, the priority might be to focus on a different range of lines. If the whole text file has only x lines with the items in a particular search list, then the result will always be those x lines. But for many search lists, there are 2x, 10x, or 100,000x lines which match, and at different times, I'd like to pick different lines. At certain times, a particular range might be the priority, at other times random sampling is best, and sometimes just finding the first x lines from the beginning is fine, hence the "normal", "random" and "bucket" search strategies.

I'd really appreciate any ideas for "random" and "bucket", as I'm not sure how to best approach them efficiently. I played around with linecache, itertools islice, readlines (per @Martin Thoma in this answer, readlines is surprisingly fast), as well as modifying the code above, but my attempts at "random" and "bucket" are all clunky, inefficient, and I know that I don't know enough to know what could be best :).

Any suggestions? Thanks.


Solution

  • For both the random and bucket searches, you can do a linear scan of the file and build a list of candidate results, replacing a candidate from the list if the list is full and a better candidate comes along.

    For random selections, you simply calculate the odds of a candidate being in the list and use a random number to determine if the candidate gets placed in the list or not.

    For bucket selections, a candidate replaces an existing list member if its bucket rank is better than the rank of the existing item.

    For random selection:

    import random
    candidates = []
    n = 0
    with open(textFile) as inFile:
        for line in inFile:
            if valid_candidate(line): # use regex here
                n += 1
                if n <= x:
                    candidates.append(line)
                else:
                    i = random.randrange(n)
                    if i < x:
                        candidates[i] = line
    results = []
    for line in candidates:
        analysisResult = analyze_the_match(line)
        results.append(analysisResult)
    

    For bucket selection:

    import bisect
    candidates = []
    n = 0
    with open(textFile) as inFile:
        for line in inFile:
            n += 1
            if valid_candidate(line): # use regex here
                rank = get_rank(n) # use n to determine the bucket and assign a rank, lower numbers are higher rank
                bisect.insort(candidates, (rank, n, line))
    results = []
    for rank, n, line in candidates[:x]:
        analysisResult = analyze_the_match(line)
        results.append(analysisResult)