I regularly have to read text files as such:
Column1; Column2; Column3
data1;dat;a2;data3
Meaning that the data in Column 2 uses the same character used as the delimiter, therefore pandas incorrectly reads this data as two columns. This is just an illustrative example, this is not the real data.
How do I efficiently overcome this? I don't know a way besides manually editing the text file, but this becomes highly inefficient very fast.
Edit: Best preemptive knowledge I can have is predicting which columns are likely to have delimiters. For example, I could realistically guess beforehand that column 2 will have delimiters in the text.
I solved the issue with the following code. Only prerequisite is that you have an idea of the columns that will be malformed:
all_corrected = [] # list to hold all corrected text lines
bad_col = 2 # index of bad column
prev_col = bad_col - 1
with open('malformed.txt') as file:
header = file.readline().strip().split(';')
for line in file:
temp_line = line.strip().split(';')
correct_line = temp_line[:prev_col] + [';'.join(temp_line[prev_col:bad_col+1])] + temp_line[bad_col+1:]
all_corrected.append(correct_line)
df = pd.DataFrame(all_corrected, columns=header)
This code is based on Serge Ballesta's answer on Issue with delimiter inside column data. Thanks to @wjandrea and @JonSG for the support.