I have a text file that looks like this
******************** extractive sentences ********************
[gold] This is gold sentence
[raw ] This is raw sentence
[rev ] This is rev sentence
[ref ] This is ref sentence
******************** extractive sentences ********************
[gold] This is gold sentence
[raw ] This is raw sentence
[rev ] This is rev sentence
[ref ] This is ref sentence
I want to create a CSV, such that all the [gold]
sentences come
in 1 column and the corresponding [rev]
sentences come in the second column
I tried reading the file using pd.read_csv
but it gives error as the sentences can contain ',' and it gives an error like this:
'Expected 1 fields in line 11, saw 2'
Can anyone help me in this?
Providing that the data in your .txt file is structured consistently as outlined in your original question, the below solution will pull content from all "[gold]
" and "[rev ]
" lines and each pair of these to a .csv file with the appropriate headers.
It achieves this by:
input.txt
into a list of stringsre
to filter this list for lines starting [gold]
, stripping the leading [gold]
and trailing newline character(\n
) into a new list containing content from all "[gold]
" lines in the order in which they appear in input.txt
[rev ]
, filtering these into their own listpandas.DataFrame
using the lists generated above to populate columns under the corresponding headers (gold
and rev
) - with gold
being the first column and rev
the second as per your original questionDataFrame
out to output.csv
by way of pandas.DataFrame.to_csv()
import pandas as pd
import re
with open('input.txt', 'r') as f:
lines = f.readlines()
gold = [ re.sub(r'(\[gold\]\s)|\n', '', l) for l in lines if re.search(r'^\[gold\]', l) ]
rev = [ re.sub(r'(\[rev\s\]\s)|\n', '', l) for l in lines if re.search(r'^\[rev\s\]', l) ]
data = {'gold': gold, 'rev': rev}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)
Important Consideration
The above solution makes the assumption that for each "[gold]
" there will be a corresponding "[rev ]
" - as in the example you provided.
If there are "[gold]
" lines that do not have corresponding "[rev ]
" lines - or vice-versa - within each "section" of the .txt file you are reading, the ordering/pairing of rows will be incorrect.
However, if the data in your .txt files is reliably consistent in terms of being structured uniformly as per your original example, this will not be an issue.