I have a csv file. The third column can contain odd separators like this:
Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123
I know that the third column can some times be faulty, also next column always starts with ID. The odd comma is between singer and songwriter. How can I replace the odd comma with lets say tilda, so it would be possible to read the file using pandas without errors? The actual file has 30 columns, so regular expressions are probably way to go. Thanks for your time.
Try the following approach:
import pandas as pd
import csv
data = []
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
For your example, this gives:
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer Songwriter ID2020-123
This assumes the unwanted commas are only in the Job column. It takes the Name
and Last Names
fields, then combines all fields until the last ID
field. So in effect the Job
field can have any number of commas.
This will need to be tweaked depending where all the other columns go.