Search code examples
pythonfilefor-loopline

Remove certain rows without iterating the whole file line by line in Python


I have the dataset like below :

Category,Date,Id,Amount
Risk A,11/12/2020,1,-10
Risk A,11/13/2020,2,10
Risk A,11/14/2020,3,22
Risk A,11/15/2020,4,32
Total Risk A : 4  ----- needs to be removed
Risk C,11/9/2020,5,43
Risk C,11/10/2020,6,22
Risk C,11/11/2020,7,11
Risk C,11/12/2020,8,-50
Total Risk C : 4   ----- needs to be removed
Risk D,11/12/2020,9,3
Risk D,11/13/2020,10,1
Risk D,11/14/2020,11,3
Risk D,11/15/2020,12,4
Risk D,11/9/2020,13,55
Risk D,11/10/2020,14,32
Total Risk C : 6      ----- needs to be removed

In between the data rows , there are some specific total(summary) rows, which I need to remove from the file. Looking for a better way to remove these rows, without iterating the file line by line in python.As I have few thousand rows and its a time taking to remove some summary lines. Kindly suggest?


Solution

  • You can use Regex to perform string substitution:

    import re
    t = """Category,Date,Id,Amount
    Risk A,11/12/2020,1,-10
    Risk A,11/13/2020,2,10
    Risk A,11/14/2020,3,22
    Risk A,11/15/2020,4,32
    Total Risk A : 4  ----- needs to be removed
    Risk C,11/9/2020,5,43
    Risk C,11/10/2020,6,22
    Risk C,11/11/2020,7,11
    Risk C,11/12/2020,8,-50
    Total Risk C : 4   ----- needs to be removed
    Risk D,11/12/2020,9,3
    Risk D,11/13/2020,10,1
    Risk D,11/14/2020,11,3
    Risk D,11/15/2020,12,4
    Risk D,11/9/2020,13,55
    Risk D,11/10/2020,14,32
    Total Risk C : 6      ----- needs to be removed"""
    
    print(re.sub(r'\nTotal.*','', t))
    

    re.sub will find all the parts of the file that matches the pattern (r'\nTotal.*': a newline followhed by the word "Total", followed by any character until the end of line), and replace them with ''.