Search code examples
pythonpostgresqlcsvspecial-characters

Pushing CSV to PostgreSQL via Python (special characters)


I am attempting something conceptually very simple but proving difficult in detail. I want to pull in generic CSV files (with fields that may contain special characters) and push them to PostgreSQL via Python PyGreSQL Library.

The trouble is that fields with special characters such as "'" are imported in double quotations " " as in T'Challa below:

INSERT INTO tester VALUES ('08/09/2018 11:13', 'EMP944274823', "T'Challa", '0', "T'Challa ", 'CPY703661946', 'Prostetnic Vogon Jeltz', 'BUY', 'BK18266', 'P514605140', 'Tele Pyschic Helmets', '74', '475', '35150')

However the below works fine when done manually:

INSERT INTO tester VALUES ('08/09/2018 11:13', 'EMP944274823', 'T''Challa', '0', 'T''Challa ', 'CPY703661946', 'Prostetnic Vogon Jeltz', 'BUY', 'BK18266', 'P514605140', 'Tele Pyschic Helmets', '74', '475', '35150')

PyGreSQL command execution requires single quotes for all fields, with a double apostrophe to denote the character within the string.

Ie "T'Challa" in the above needs to be converted to 'T''Challa'.

As it stands, it doesn't recognise the format.

Any help would be appreciated!

I have tried search and replace in the quotation marks (No amount of searching within the strings seems to even find the quotation, so I am not sure how to go about the quotation (tried the below):

query.replace('"','''')
query.replace('"',"\'")
query.replace('"',"'")

etc

), playing with import options quotechar and quoting.

Attempting a couple of mapping steps:

    for row in reader:
        row_new = map(lambda x: str.replace(x, "'", "''"), row)
        row_new2 = map(lambda y: str.replace(y,'"',"'"), row_new)
        query = ('INSERT INTO %s VALUES %r' %(table_name,tuple(row_new2)))

Gets me halfway there by getting the double-' in place, but the second map doesn't seem to change anything:

INSERT INTO tester VALUES ('10/09/2018 10:10', 'EMP944274823', "T''Challa", '0', "T''Challa ",...

I have also tried importing into a dataframe via Pandas but the result was the same.

As above, I am using the PyGreSQL library to push the individual rows of the CSVs. It runs fine until running into the offending lines as above.


from pgdb import connect

dbxl = connect(dbname=user_db,user="postgres", password="MY_SECRET",host="XXX.XX.XX.XXX", port=6667)
cursor = dbxl.cursor()

with open('/home/postgres/PYSTARTERENV/example.csv', 'rt',encoding="utf8") as csvfile:
    reader = csv.reader(csvfile,quotechar="'")

    for row in reader:
        query = ('INSERT INTO %s VALUES %r' %(table_name,tuple(row)))
        cursor.execute(query)

The error spat back is below.

Traceback (most recent call last): File "iteration_test.py", line 57, in cursor.execute(query) File "/home/postgres/PYSTARTERENV/lib/python3.6/site-packages/pgdb.py", line 1037, in execute return self.executemany(operation, [parameters]) File "/home/postgres/PYSTARTERENV/lib/python3.6/site-packages/pgdb.py", line 1062, in executemany rows = self._src.execute(sql) pg.ProgrammingError: ERROR: column "T'Challa" does not exist LINE 1: ...ester VALUES ('08/09/2018 08:56', 'EMP944274823', "T'Challa"...


Solution

  • You can import the csv directly into the table using a query and python. Assuming you have established a connection between the DB and the py script; the following should work.

    cursor.execute("COPY table_name FROM ‘/path_to_csv_file.csv’ WITH FORMAT csv")
    db.commit()
    

    or

    cursor.execute("COPY table_name FROM '/path_to_csv_file.csv' DELIMITERS ',' CSV")
    db.commit()
    

    source