Search code examples
pythonpandascsvdata-analysisdata-cleaning

Can I bypass the hard coding in Pandas/Python and set a line-terminator of my choice?


I have a super dirty text-heavy dataset. While the various column values are tab-separated but there are many line breaks within the desired row of data. All data entries are separated by a hard '\n' notation.

I tried Setting the lineterminator argument as '\n', but it is still reading the line breaks as a new row. Performing any sort of regex or related operation is most likely resulting in loss of tab separations, which I need to load my data into a dataframe. Also doing a word-wise of line-wise operation is not exactly feasible owing to the size of the dataset.

Is there a way I can get the Pandas not to read the line breaks as a new row, and go to the new line only when it sees a '\n'?

Snapshot of my data: The unprocessed dataset

Below is a quick look at the current state: current output

The highlighted red box should be one entry.


Solution

  • You could preprocess to a proper TSV and then read it from there. Use itertools.groupby to find the "\N" endings. If there are other problems with this file, such as internal tabs not being escaped, all bets are off.

    import itertools
    import re
    
    separator_re = re.compile(r"\s*\\N\s*$", re.MULTILINE)
    
    with open('other.csv') as infp:
        with open('other-conv.csv', 'w') as outfp:
            for hassep, subiter in itertools.groupby(infp, separator_re.search):
                if hassep:
                    outfp.writelines("{}\n".format(separator_re.sub("",line))
                        for line in subiter)
                else:
                    for line in subiter:
                        if line.endswith("\\\n"):
                            line = line[:-2] + " "
                        else:
                            line = line.strip()
                        outfp.write(line)