Search code examples
pythonpython-re

How to delete all lines that contain a number lower than given value from a text file?


I'm trying to remove every line from a file that has any number below the value of -2000. I'm quite new to python and it's most likely that I don't understand the re module, nor am I sure about the method I am using.

Here is the sample file:

{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }
{ "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 }
{ "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 }

And here is what I've got:

 with open('file.json','r') as input:
    with open("temp.json", 'w') as output:  
        for line in input:
            match = re.search(r'('-'\d+)', line)
            my_number = float(match.group())
            if my_number < -2000:
                output.write(line.strip())

As for now I'm sure that in re.search(r'('-'\d+)), '-'is wrong. I'm also not sure about the proper use of match.group().

If anyone would guide me in the right direction or suggest a different method I would be grateful.


Solution

  • You can use a regex like r'-[2-9]\d{3,}' that should match -2000 or lower.

    Explanation: match - for negative and a number between 2-9, followed by any 3 or more digits.

    Why use regex? It's actually faster than an approach with json.loads (see below).

    The downside, however is it seems to also match -2000 by itself.

    import re
    
    # please note: not valid JSON here
    line = '{ "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 }'
    
    match = re.search(r'-[2-9]\d{3,}', line)
    
    print(bool(match))
    print(match.group(0))
    

    Output:

    True
    -3107
    

    To include lines with numbers that are exactly -2000 (which is an edge case) you can use re.findall to find all numbers that are -2000 or below, then cast the numeric values to float, and do a comparison similar to how you had it above:

    import re
    
    
    NEG_TARGET_RE = re.compile(r'-[2-9]\d{3,}(?:\.\d+)?')
    
    
    def has_any_num_less_than_target(line, target=-2000) -> bool:
        for m in NEG_TARGET_RE.findall(line):
            if float(m) < target:
                return True
    
        return False
    
    
    line = '{ "Position": { "X": -1660.313, "Y": -2000.795, "Z": 12.85458 }'
    print(has_any_num_less_than_target(line))  # True
    
    line = '{ "Position": { "X": -1660.313, "Y": -2000, "Z": -2000.000 }'
    print(has_any_num_less_than_target(line))  # False
    

    Performance Comparison

    Quick benchmarks show that regex approach is ~4x faster than an approach with json.loads.

    import json
    from timeit import timeit
    
    lines = """
    { "Position": { "X": -1660.313, "Y": -3107.795, "Z": 12.85458 } }
    { "Position": { "X": -494.0083, "Y": 57.33647, "Z": 56.59263 } }
    { "Position": { "X": -1039.662, "Y": -2641.444, "Z": 36.96656 } }
    """
    
    print('json:  ', timeit(r"""
    for line in lines.strip().split('\n'):
        d = json.loads(line)
        if all(x >= -2000 for x in d['Position'].values()):
            ...
    """, globals=globals(), number=1000))
    
    print('re:    ', timeit(r"""
    for line in lines.strip().split('\n'):
        if not has_any_num_less_than_target(line):
            ...
    """, globals=globals(), number=1000))
    

    Result:

    json:   0.004367416957393289
    re:     0.0011514590587466955