Search code examples
pythonregexstrip

Strip text with custom regex in python


(2, 43) 0.74670222994
(3, 15) 0.74132892839
(3, 31) 0.671141877647
(4, 19) 0.699490245832
(4, 47) 0.422715095257
(4, 48) 0.433278265941
(4, 0)  0.379862196713
(5, 19) 0.653731227092
(5, 72) 0.756726821729

Above is a tfidf matrix which has been written to a file. I want to read only the tf-idf values like 0.74132892839 and append them to a list.

Is there a way to do f.read() and then strip the indices off?


Solution

  • Simple solution using re.sub() function:

    import re
    
    # specify your actual file name
    with open('lines.txt', 'r') as fh:
        result = re.sub(r'\([^)]+\)\s*', '', fh.read()).split('\n')
    
    print(result)
    

    The output:

    ['0.74670222994', '0.74132892839', '0.671141877647', '0.699490245832', '0.422715095257', '0.433278265941', '0.379862196713', '0.653731227092', '0.756726821729']
    

    \([^)]+\) - matches a sequence between parentheses