I am working with the GDELT dataset am having issues creating a pandas DataFrame
using pd.DataFrame.from_csv(path_to_data, sep=",")
which seems to load the data fine except except for the fact that the first header column is shifted to row 1 like so:
The arrow indicates where Source should be. Here is a snippet of the raw data in CSV format:
Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
POLICE,COP,,CA,TORONTO,,,CA,173,31
PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22
Thanks!
Calvin
Don't use from_csv
it's no longer maintained, use read_csv
:
In [244]:
t="""Source,Actor1Type1Code,Actor1Type2Code,Actor1Geo_CountryCode,Target,Actor2Type1Code,Actor2Type2Code,Actor2Geo_CountryCode,EventCode,f0_
PRINCE,GOV,,CA,CITIZEN,CVL,,CA,051,61
MEDIA,MED,,CA,MINIST,GOV,,CA,090,39
SUPREME COURT,JUD,,CA,DOCTOR,HLH,,CA,060,31
POLICE,COP,,CA,TORONTO,,,CA,173,31
PUBLISHER,MED,,CA,BUSINESS,BUS,,CA,010,29
HOSPITAL,HLH,,CA,POLICE,COP,,CA,043,28
HOSPITAL,HLH,,CA,TORONTO,,,CA,043,26
POLICE,COP,,CA,HOSPITAL,HLH,,CA,042,26
PRIME MINISTER,GOV,,CA,GERMANY,,,FR,042,22"""
df = pd.read_csv(io.StringIO(t))
df
Out[244]:
Source Actor1Type1Code Actor1Type2Code Actor1Geo_CountryCode \
0 PRINCE GOV NaN CA
1 MEDIA MED NaN CA
2 SUPREME COURT JUD NaN CA
3 POLICE COP NaN CA
4 PUBLISHER MED NaN CA
5 HOSPITAL HLH NaN CA
6 HOSPITAL HLH NaN CA
7 POLICE COP NaN CA
8 PRIME MINISTER GOV NaN CA
Target Actor2Type1Code Actor2Type2Code Actor2Geo_CountryCode EventCode \
0 CITIZEN CVL NaN CA 51
1 MINIST GOV NaN CA 90
2 DOCTOR HLH NaN CA 60
3 TORONTO NaN NaN CA 173
4 BUSINESS BUS NaN CA 10
5 POLICE COP NaN CA 43
6 TORONTO NaN NaN CA 43
7 HOSPITAL HLH NaN CA 42
8 GERMANY NaN NaN FR 42
f0_
0 61
1 39
2 31
3 31
4 29
5 28
6 26
7 26
8 22
Or pass param index_col=None
:
df = pd.DataFrame.from_csv(io.StringIO(t), index_col=None)
so it doesn't interpret the first column as an index column