Search code examples
pythonpandascsvgdelt

Pandas creates DataFrame with first header column in it's own row


I am working with the GDELT dataset am having issues creating a pandas DataFrame using pd.DataFrame.from_csv(path_to_data, sep=",") which seems to load the data fine except except for the fact that the first header column is shifted to row 1 like so:

enter image description here

The arrow indicates where Source should be. Here is a snippet of the raw data in CSV format:

Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
POLICE,COP,,CA,TORONTO,,,CA,173,31
PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22

Thanks!

Calvin


Solution

  • Don't use from_csv it's no longer maintained, use read_csv:

    In [244]:
    
    t="""Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
    PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
    MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
    SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
    POLICE,COP,,CA,TORONTO,,,CA,173,31
    PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
    HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
    HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
    POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
    PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22"""
    df = pd.read_csv(io.StringIO(t))
    df
    Out[244]:
               Source Actor1Type1Code  Actor1Type2Code Actor1Geo_CountryCode  \
    0          PRINCE             GOV              NaN                    CA   
    1           MEDIA             MED              NaN                    CA   
    2   SUPREME COURT             JUD              NaN                    CA   
    3          POLICE             COP              NaN                    CA   
    4       PUBLISHER             MED              NaN                    CA   
    5        HOSPITAL             HLH              NaN                    CA   
    6        HOSPITAL             HLH              NaN                    CA   
    7          POLICE             COP              NaN                    CA   
    8  PRIME MINISTER             GOV              NaN                    CA   
    
         Target Actor2Type1Code  Actor2Type2Code Actor2Geo_CountryCode  EventCode  \
    0   CITIZEN             CVL              NaN                    CA         51   
    1    MINIST             GOV              NaN                    CA         90   
    2    DOCTOR             HLH              NaN                    CA         60   
    3   TORONTO             NaN              NaN                    CA        173   
    4  BUSINESS             BUS              NaN                    CA         10   
    5    POLICE             COP              NaN                    CA         43   
    6   TORONTO             NaN              NaN                    CA         43   
    7  HOSPITAL             HLH              NaN                    CA         42   
    8   GERMANY             NaN              NaN                    FR         42   
    
       f0_  
    0   61  
    1   39  
    2   31  
    3   31  
    4   29  
    5   28  
    6   26  
    7   26  
    8   22  
    

    Or pass param index_col=None:

    df = pd.DataFrame.from_csv(io.StringIO(t), index_col=None)
    

    so it doesn't interpret the first column as an index column