Search code examples
pythoncsvtext

Text file to CSV with a special condition on the text


I have a text file that looks like this

******************** extractive sentences ********************

[gold] This is gold sentence

[raw ] This is raw sentence

[rev ] This is rev sentence

[ref ] This is ref sentence

******************** extractive sentences ********************

[gold] This is gold sentence

[raw ] This is raw sentence

[rev ] This is rev sentence

[ref ] This is ref sentence

I want to create a CSV, such that all the [gold] sentences come in 1 column and the corresponding [rev] sentences come in the second column

I tried reading the file using pd.read_csv but it gives error as the sentences can contain ',' and it gives an error like this:

'Expected 1 fields in line 11, saw 2'

Can anyone help me in this?


Solution

  • Providing that the data in your .txt file is structured consistently as outlined in your original question, the below solution will pull content from all "[gold]" and "[rev ]" lines and each pair of these to a .csv file with the appropriate headers.

    It achieves this by:

    • Reading each line from input.txt into a list of strings
    • Using re to filter this list for lines starting [gold], stripping the leading [gold] and trailing newline character(\n) into a new list containing content from all "[gold]" lines in the order in which they appear in input.txt
    • Repeating the above step for lines starting [rev ], filtering these into their own list
    • Creating a pandas.DataFrame using the lists generated above to populate columns under the corresponding headers (gold and rev) - with gold being the first column and rev the second as per your original question
    • Writing the DataFrame out to output.csv by way of pandas.DataFrame.to_csv()
    import pandas as pd
    import re
    
    with open('input.txt', 'r') as f:
        lines = f.readlines()
        
    gold = [ re.sub(r'(\[gold\]\s)|\n', '', l) for l in lines if re.search(r'^\[gold\]', l) ]
    
    rev = [ re.sub(r'(\[rev\s\]\s)|\n', '', l) for l in lines if re.search(r'^\[rev\s\]', l) ]
    
    data = {'gold': gold, 'rev': rev}
    
    df = pd.DataFrame(data)
    
    df.to_csv('output.csv', index=False)
    

    Important Consideration

    The above solution makes the assumption that for each "[gold]" there will be a corresponding "[rev ]" - as in the example you provided.

    If there are "[gold]" lines that do not have corresponding "[rev ]" lines - or vice-versa - within each "section" of the .txt file you are reading, the ordering/pairing of rows will be incorrect.

    However, if the data in your .txt files is reliably consistent in terms of being structured uniformly as per your original example, this will not be an issue.