Search code examples
pythonpython-3.xrangecut

How to create a copy of a file with multipe ranges of columns removed in python3


I want to make a copy of a file with fixed-width records with multiple numeric ranges removed. For example a file has fixed width records 1600 long, and I want to keep columns 0-83, 89-1517, 1526-end. This is for use in a larger problem, standalone utilities like cut and awk won't help here.

I have this which I apply to each line/record; it works okay, wonder if anything obviously better.

"".join([full[:84], full[89:1518], full[1526:]])

In particular I'd find it more natural to specify what to cut than what to keep, if there's standard library or easy to read quick function that is more like

 # hypothetical
 cut(line, [ [84,88], [1519, 25] ])

ADDITION

To accepted answer, use sorted list of cuts, so caller can give in any order. Would be nice to add overlap detection as well

def cut(line, cuts):
    sorted_cuts = sorted(cuts, key=lambda x: x[0])
    
    return ''.join(line[slice(keep_start, keep_end)]
                   for keep_start, keep_end in zip(
                           [None] + [cut_end for cut_start, cut_end in sorted_cuts],
                           [cut_start for cut_start, cut_end in sorted_cuts] + [None]))


origline = "0123456789"

assert (cut(origline, [[1,2], [3,4]]) ==
        cut(origline, ([3,4], (1,2))) ==
        cut(origline, [[3,4], [1,2]]))

print(cut(origline, [[1,2], [3,4]]))

Solution

  • Here is an implementation of your hypothetical cut function.

    def cut(line, cuts):
        
        return ''.join(line[slice(keep_start, keep_end)]
                       for keep_start, keep_end in zip(
                               [None] + [cut_end for cut_start, cut_end in cuts],
                               [cut_start for cut_start, cut_end in cuts] + [None]))
    
    print(cut('abcdefghijklmnopqrstuvwxyz', [[1,3], [9,10]]))
    

    gives:

    adefghiklmnopqrstuvwxyz
    

    (bc and j were cut)

    So:

    • the slice to keep goes from the start to the start of the first cut,
    • the slice to keep goes from the end of the first cut to the start of the second cut
    • ...
    • the last slice to keep goes from the end of the last cut to the end of the string

    The [None] + [cut_end for cut_start, cut_end in cuts] is the start of each slice to keep, in this example [None, 3, 10]

    The [cut_start for cut_start, cut_end in cuts] + [None] is the end of each slice to keep, in this example [1, 9, None]

    where None means start/end of string as used by the slice builtin.



    Note: to implement the cuts given in your example, you would supply the arguments to this cut function as:

     cut(line, [[84, 89], [1519, 1526]])
    

    where the second element of each 2-element list is the index after the end of the cut, in keeping with normal python indexing conventions.

    If you really want not to have to do this (in order to get exactly the cut function that you describe above), then in the above code you would replace:

    [cut_end for cut_start, cut_end in cuts]
    

    with:

    [cut_end + 1 for cut_start, cut_end in cuts]
    

    For convenience, here is the full code of the function in that case, and the calling code that you would use in your example:

    def cut(line, cuts):
        
        return ''.join(line[slice(keep_start, keep_end)]
                       for keep_start, keep_end in zip(
                               [None] + [cut_end + 1 for cut_start, cut_end in cuts],
                               [cut_start for cut_start, cut_end in cuts] + [None]))
    
    print(cut(line, [[84, 88], [1519, 1525]])