Search code examples
pythonpython-3.xtext-filesdata-analysisdata-extraction

How to extract specific lines from a text file and then from these extracted line, extract the values between parantheses and put them in another file


infile = open('results1', 'r')
lines = infile.readlines()
import re
for line in lines:
if re.match("track: 1,", line):
    print(line) 

question solved by using python regex below


Solution

  • I suggest you use Regular Expressions library (re) which gives you all you need to extract the data from text files. I ran a simple code to solve your current problem:

    import re
    # Customize path as the file's address on your system
    text_file = open('path/sample.txt','r')
    # Read the file line by line using .readlines(), so that each line will be a continuous long string in the "file_lines" list
    file_lines = text_file.readlines()
    

    Depending on how your target is located in each line, detailed process from here on could be a little different but the overall approach is the same in every scenario. I have assumed your only condition is that the line starts with "Id of the track" and we are looking to extract all the values between parentheses all in one place.

    # A list to append extracted data
    list_extracted_data = []
    for line in list_lines:
    # Flag is True if the line starts (special character for start: \A) with 'Id of the track'
        flag = re.search('\AId of the track',line)
        if flag:
            searched_phrase = re.search(r'\B\(.*',line)
            start_index, end_index = searched_phrase.start(), searched_phrase.end()
    # Select the indices from each line as it contains our extracted data
            list_extracted_data.append(line[start_index:end_index])
            
    print(list_extracted_data)
    

    ['(0.8835006455995176, -0.07697617837544447)', '(0.8835006455995176, -0.07697617837544447)', '(0.8835006455995176, -0.07697617837544447)', '(0.8835006455995176, -0.07697617837544447)', '(0.8755597308669424, -0.23473345870373538)', '(0.8835006455995176, -0.07697617837544447)', '(0.8755597308669424, -0.23473345870373538)', '(6.4057079727806485, -0.6819141582566414)', '(1.1815888836384334, -0.35535274681454954)']

    you can do all sorts of things after you've selected the data from each line, including convert it to numerical type or separating the two numbers inside the parentheses. I assume your intention was to add each of the numbers inside into a different column in a dataFrame:

    final_df = pd.DataFrame(columns=['id','X','Y'])
    for K, pair in enumerate(list_extracted_data):
        # split by comma, select the left part, exclude the '(' at the start
        this_X = float(pair.split(',')[0][1:])
        # split by comma, select the right part, exclude the ')' at the end
        this_Y = float(pair.split(',')[1][:-1])
        final_df = final_df.append({'id':K,'X':this_X,'Y':this_Y},ignore_index=True)
    

    enter image description here