Search code examples
pythonpandascsvtext

Python pandas read file with delimiters inside the data?


I regularly have to read text files as such:

Column1; Column2; Column3

data1;dat;a2;data3

Meaning that the data in Column 2 uses the same character used as the delimiter, therefore pandas incorrectly reads this data as two columns. This is just an illustrative example, this is not the real data.

How do I efficiently overcome this? I don't know a way besides manually editing the text file, but this becomes highly inefficient very fast.

Edit: Best preemptive knowledge I can have is predicting which columns are likely to have delimiters. For example, I could realistically guess beforehand that column 2 will have delimiters in the text.


Solution

  • I solved the issue with the following code. Only prerequisite is that you have an idea of the columns that will be malformed:

    all_corrected = []  # list to hold all corrected text lines
    
    bad_col = 2  # index of bad column
    prev_col = bad_col - 1
    
    with open('malformed.txt') as file:
        header = file.readline().strip().split(';')
    
        for line in file:
            temp_line = line.strip().split(';')
            correct_line = temp_line[:prev_col] + [';'.join(temp_line[prev_col:bad_col+1])] + temp_line[bad_col+1:]
    
            all_corrected.append(correct_line)
        
    df = pd.DataFrame(all_corrected, columns=header)
    

    This code is based on Serge Ballesta's answer on Issue with delimiter inside column data. Thanks to @wjandrea and @JonSG for the support.