Search code examples
pythonparsingpandas

How can I remove extra whitespace from strings when parsing a csv file in Pandas?


I have the following file named 'data.csv':

1997,Ford,E350
1997, Ford , E350
1997,Ford,E350,"Super, luxurious truck"
1997,Ford,E350,"Super ""luxurious"" truck"
1997,Ford,E350," Super luxurious truck "
"1997",Ford,E350
1997,Ford,E350
2000,Mercury,Cougar

And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

The best I could do was:

pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])

Which gets me:

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997    Ford     E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350   Super luxurious truck 
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

How can I get the DataFrame without those whitespaces?


Solution

  • You could use converters:

    import pandas as pd
    
    def strip(text):
        try:
            return text.strip()
        except AttributeError:
            return text
    
    def make_int(text):
        return int(text.strip('" '))
    
    table = pd.read_table("data.csv", sep=r',',
                          names=["Year", "Make", "Model", "Description"],
                          converters = {'Description' : strip,
                                        'Model' : strip,
                                        'Make' : strip,
                                        'Year' : make_int})
    print(table)
    

    yields

       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None