Search code examples
pythonpandasdouble-quotes

pandas data with double quote


I am trying to read a large dataset in .csv format which is update automatically using the pandas library. The problem is that in my data, the first row is a string without double quotation marks, and the other colums are strings with double quotation marks. It is not possible for me to adjust the .csv file manually.

A simplified dataset would look like this

  1. A,"B","C","D"
  2. comp_a,"tree","house","door"
  3. comp_b,"truck","red","blue"

I need the data to be stored as separate columns without the quotation marks like this:

  1. A B C D
  2. comp_a tree house door
  3. comp_b truck red blue

I tried using

import pandas as pd
df_csv = pd.read(path_to_file,delimiter=',')

which gives me the complete header as a single variable for the last column

  1. A,"B","C","D"
  2. comp_a "tree" "house" "door"
  3. comp_b "truck" "red" "blue"

The closest result to the one i need was by using the following

df_csv = pd.read(path_to_file,delimiter=',',quoting=3)

which correctly recognizes each column, but adds in a bunch of extra double quotes.

  1. "A ""B"" ""C"" ""D"""
  2. "comp_a ""tree"" ""house"" ""door"""
  3. "comp_b ""truck"" ""red"" ""blue"""

Setting quoting to a value from 0 to 2 just reads an entire row as a single column.

Does anyone know how I can remove all quotation marks when reading the .csv file?


Solution

  • Just load the data with pd.read_csv() and then use .replace('"','', regex=True)

    In one line it would be:

    df = pd.read_csv(filename, sep=',').replace('"','', regex=True)
    

    To set the columns names:

    df.columns = df.iloc[0]
    

    And drop row 0:

    df = df.drop(index=0).reset_index(drop=True)