Search code examples
pythoncsvparsingsplit

Parsing a string with text within it


I'm trying to read in and parse a CSV output file from PostgreSQL, but there are some oddities in the data. The data contains vehicle details. One problem is with old Land Rovers which have models measured in inches, e.g. 100". PostgreSQL outputs this as "100'"" - it uses a single quote as the escape character. REGXXXX,LAND ROVER,"100'"",1973-11-16,Diesel,2500,Silver I've tried the csv.reader but this throws an error when it gets to "100'"" even if I specify encoding="utf8".

I can just open it as a text file (using encoding="utf8"), take a line at a time and use split(","), which handles the inch though it still has speech marks at each end: ['LAND ROVER', '"100'""']

However, I now have a problem that some of the models have a comma in their description - e.g. "R1, STREETFIGHTER" incorrectly gets split up into '"R1' and ' STREETFIGHTER"' (I haven't checked if csv.reader does the same). REGXXXX,YAMAHA,"R1, STREETFIGHTER",2002-04-02,Petrol,998,White

How can I parse the line, without splitting the text within speech mark? Or do I just need to put together my own procedure?

in_file = open(in_folder + in_file, "r", encoding="utf8")
for line in in_file:    
    row_list = line.split(",")

Solution

  • You can use the built-in csv module for this. You just need to specify the escapechar when constructing the reader as follows:

    from pathlib import Path
    import csv
    
    FILE = Path("~/Python/cars.csv").expanduser()
    
    with FILE.open(mode="r", newline="") as f:
        for row in csv.reader(f, escapechar="'"):
            print(row)
    

    Output for the content shown in the OP:

    ['REGXXXX', 'LAND ROVER', '100"', '1973-11-16', 'Diesel', '2500', 'Silver']