Search code examples
pythonpandascsvpython-re

How to replace odd separators from a csv file to something else using python regex?


I have a csv file. The third column can contain odd separators like this:

Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123

I know that the third column can some times be faulty, also next column always starts with ID. The odd comma is between singer and songwriter. How can I replace the odd comma with lets say tilda, so it would be possible to read the file using pandas without errors? The actual file has 30 columns, so regular expressions are probably way to go. Thanks for your time.


Solution

  • Try the following approach:

    import pandas as pd
    import csv
    
    data = []
    
    with open('input.csv') as f_input:
        csv_input = csv.reader(f_input)
        header = next(csv_input)
        
        for row in csv_input:
            data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])
    
    df = pd.DataFrame(data, columns=header)
    print(df)
    

    For your example, this gives:

         Name Last Name                    Job          ID
    0    John     Smith              Architect      ID2020
    1  Taylor     Swift      Singer Songwriter  ID2020-123
    

    This assumes the unwanted commas are only in the Job column. It takes the Name and Last Names fields, then combines all fields until the last ID field. So in effect the Job field can have any number of commas.

    This will need to be tweaked depending where all the other columns go.