Search code examples
pythonnumpygenfromtxt

Python - working with uneven columns in rows


I am working with a data with thousands of rows but I have uneven columns as shown below:

AB  12   43   54

DM  33   41   45   56   33   77  88

MO  88   55   66   32   34 

KL  10   90   87   47   23  48  56  12

First, I want to read the data in list or array and then find out the length of longest row.
Then, I will add zeros to the short rows to equal them to the longest one, so that I can iterate them as a 2D array.

I have tried a couple of other similar questions, but could not work out the problem.

I believe there is a way in Python to do this. Could anyone please help me out?


Solution

  • I don't see any easier way to figure out the maximum row length but to do one pass and to find it. Then, we build the 2D array in a second pass. Something like:

    from __future__ import print_function
    import numpy as np
    from itertools import chain
    
    data = '''AB 12 43 54
    DM 33 41 45 56 33 77 88
    MO 88 55 66 32 34
    KL 10 90 87 47 23 48 56 12'''
    
    max_row_len = max(len(line.split()) for line in data.splitlines())
    
    def padded_lines():
        for uneven_line in data.splitlines():
            line = uneven_line.split()
            line += ['0']*(max_row_len - len(line))
            yield line
    
    # I will get back to the line below shortly, it unnecessarily creates the array
    # twice in memory:
    array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))
    
    array.shape = (-1, max_row_len)
    
    print(array)
    

    This prints:

    [['AB' '12' '43' '54' '0' '0' '0' '0' '0']
     ['DM' '33' '41' '45' '56' '33' '77' '88' '0']
     ['MO' '88' '55' '66' '32' '34' '0' '0' '0']
     ['KL' '10' '90' '87' '47' '23' '48' '56' '12']]
    

    The above code is inefficient in the sense that it creates the array twice in memory. I will get back to it; I think I can fix that.

    However, numpy arrays are supposed to be homogeneous. You want to put strings (the first column) and integers (all the other columns) in the same 2D array. I still think you are on the wrong track here and should rethink the problem and pick another data structure or organize your data differently. I cannot help you with that since I don't know how you want to use the data.

    (I will get back to the array created twice issue shortly.)


    As promised, here is the solution to the efficiency issues. Note that my concerns were about memory consumption.

        def main():
    
            with open('/tmp/input.txt') as f:
                max_row_len = max(len(line.split()) for line in f)
    
            with open('/tmp/input.txt') as f:
                str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
    
            def padded_lines():
                with open('/tmp/input.txt') as f:
                    for uneven_line in f:
                        line = uneven_line.split()
                        line += ['0']*(max_row_len - len(line))
                        yield line
    
            fmt = '|S%d' % str_len_max
            array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
    

    This code could be made nicer but I will leave that up to you.

    The memory consumption, measured with memory_profiler on a randomly generated input file with 1000000 lines and uniformly distributed row lengths between 1 and 20:

    Line #    Mem usage    Increment   Line Contents
    ================================================
         5   23.727 MiB    0.000 MiB   @profile
         6                             def main():
         7                                 
         8   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
         9   23.727 MiB    0.000 MiB           max_row_len = max(len(line.split()) for line in f)
        10                                     
        11   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
        12   23.727 MiB    0.000 MiB           str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
        13                                 
        14   23.727 MiB    0.000 MiB       def padded_lines():
        15                                     with open('/tmp/input.txt') as f:
        16   62.000 MiB   38.273 MiB               for uneven_line in f:
        17                                             line = uneven_line.split()
        18                                             line += ['0']*(max_row_len - len(line))
        19                                             yield line
        20                                 
        21   23.727 MiB  -38.273 MiB       fmt = '|S%d' % str_len_max
        22                                 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
        23   62.004 MiB   38.277 MiB       array.shape = (-1, max_row_len)
    

    With the code eumiro's answer, and with the same input file:

    Line #    Mem usage    Increment   Line Contents
    ================================================
         5   23.719 MiB    0.000 MiB   @profile
         6                             def main():
         7   23.719 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
         8  638.207 MiB  614.488 MiB           arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T
    

    Comparing the memory consumption increments: My updated code consumes 16 times less memory than eumiro's (614.488/38.273 is approx. 16).

    As for speed: My updated code runs for this input for 3.321s, eumiro's code runs for 5.687s, that is, mine is 1.7x faster on my machine. (Your mileage may vary.)

    If efficiency is your primary concern (as suggested by your comment "Hi eumiro, I suppose this is more efficient." and then changing the accepted answer), then I am afraid you accepted the less efficient solution.

    Don't get my wrong, eumiro's code is really concise, and I certainly learned a lot from it. If efficiency is not my primary concern, I would go with eumiro's solution too.