I have a super dirty text-heavy dataset. While the various column values are tab-separated but there are many line breaks within the desired row of data. All data entries are separated by a hard '\n' notation.
I tried Setting the lineterminator
argument as '\n'
, but it is still reading the line breaks as a new row.
Performing any sort of regex or related operation is most likely resulting in loss of tab separations, which I need to load my data into a dataframe. Also doing a word-wise of line-wise operation is not exactly feasible owing to the size of the dataset.
Is there a way I can get the Pandas not to read the line breaks as a new row, and go to the new line only when it sees a '\n'?
Snapshot of my data: The unprocessed dataset
Below is a quick look at the current state: current output
The highlighted red box should be one entry.
You could preprocess to a proper TSV and then read it from there. Use itertools.groupby
to find the "\N" endings. If there are other problems with this file, such as internal tabs not being escaped, all bets are off.
import itertools
import re
separator_re = re.compile(r"\s*\\N\s*$", re.MULTILINE)
with open('other.csv') as infp:
with open('other-conv.csv', 'w') as outfp:
for hassep, subiter in itertools.groupby(infp, separator_re.search):
if hassep:
outfp.writelines("{}\n".format(separator_re.sub("",line))
for line in subiter)
else:
for line in subiter:
if line.endswith("\\\n"):
line = line[:-2] + " "
else:
line = line.strip()
outfp.write(line)