Search code examples
pythonparsingmultiprecisiongmpy

Is there a more elegant way to read a Textfile containing mpz values into a list of integers?


I have a Textfile containing numbers that looks as follows:

[mpz(0), mpz(0), mpz(0), mpz(0), mpz(4), mpz(54357303843626),...]

Does there exist a simple way to parse it directly into an integer list? It doesn't matter whether the target data type is a mpz integer or a plain python integer.

What I tried so far and works is pure parsing (note: the target array y_val3 needs to be initialized with zeros in advance, since it may be larger than the list in the Textfile):

text_file = open("../prod_sum_copy.txt", "r")
content = text_file.read()[1:-1]
text_file.close()
content_list = content.split(",")
y_val3 = [0]*10000
print(content_list)
for idx, str in enumerate(content_list):
    m = re.search('mpz\(([0-9]+)\)', str)
    y_val3[idx]=int(m.group(1))
print(y_val3)

Althought this approach works, I am not sure if this is a best practice or wether there exist a more elegant way than just plain parsing.

To facilitate things: Here is the original Textfile on GitHub. Note: This Textfile might grow in furure, which brings aspects such as performance and scalability into play.


Solution

  • I tried look at a more elegant solution from both the human-readable perspective and from the performance perspective.

    Caveats:

    • There is a lot going on here
    • I do not have the original file, so the numbers below will not match any numbers you might get on your device
    • There is too much work to try and benchmark all the various parts so I tried to focus on several of the biggest components

    The breakouts and timing below seem to show an order of magnitude difference in several of the approaches, so they may still be of use in gauging level of computational effort.

    My first approach was to try and measure the amount of overhead the file read/write added to the process so that we could explore how much computational effort was focused on just the data processing step.

    To do this, I made a function that included the file read and measured the whole process, end to end to see how long it took with my mini example file. I did this using %timeit in a Jupyter notebook.

    I then broke out the file reading step into it's own function and then used %timeit on just the data processing step to help show us:

    • how much time was used by file reads vs data processing in the original approach
    • how much time was used by the data processing approach in the improved approach.

    Original Approach (in a function)

    import re
    
    def original():
        text_file = open("../prod_sum_copy.txt", "r")
        content = text_file.read()[1:-1]
        text_file.close()
        content_list = content.split(",")
    
        y_val3 = [0]*10000
    
        for idx, element in enumerate(content_list):
            m = re.search('mpz\(([0-9]+)\)', element)
            y_val3[idx]=int(m.group(1))
        return y_val3
    

    I am gonna presume that a significant portion of the processing time for my really short example data is just gonna be the time used to open the file on disk, read the data into memory, close the file, etc.

    %timeit original()
    140 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    Separate Readfile from Data Processing Approach

    This approach includes a minor improvement to the file reading process. The timing test does not include the file reading process, so we won't know how much that minor change affects the overall process. For the record, I eliminated the manual call to the .close() method by encapsulating the reading process in a with context manager (which handles closing in the background) as this is a Python best practice for reading in files.

    import re
    
    def read_filea():
        with open("../prod_sum_copy.txt", "r") as text_file:
            content = text_file.read()[1:-1]
            return content
    
    content = read_filea()
    print(content)
    def a():
        y_val3 = [0]*10000
        content_list = content.split(",")
        for idx, element in enumerate(content_list):
            m = re.search('mpz\(([0-9]+)\)', element)
            y_val3[idx]=int(m.group(1))
        return y_val3
    

    By timing just the data processing portion, we see that it appears as though our prediction that file read (IO) plays a big component in this simple test case. It also provides us with an idea for how much time we should expect to take for just the data processing portion. Let's look at another approach to see if we can trim that time down a bit.

    %timeit read_filea()
    21.5 µs ± 185 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    Simplified Data Processing Approach (and Separate Readfile)

    Here we will try to use some Python best practices OR Python tools to cut down on the overall time, including:

    • list comprehension
    • use of the re.findall() method to eliminate some of the direct and repeated calls to the re.search() function and the direct and repeated calls to the m.group() method (NOTE: findall is likely doing some of that in the background and I honestly don't know if us avoiding it will have a benefit). BUT I find the readability of this approach to be higher than the original approach.

    Let's look at the code:

    import re
    
    def read_fileb():
        with open("../prod_sum_copy.txt", "r") as text_file:
            content = text_file.read()[1:-1]
        return content
    
    content = read_fileb()
    
    def b():
        y_val3 = [int(element) for element in re.findall(r'mpz\(([0-9]+)\)', content)]
        return y_val3
    

    The data processing portion of this approach is about 10 times faster than the data processing steps in the original approach.

    %timeit b()
    2.89 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)