python-3.x pandas import export geonames

Import text file from geonames using pandas python

I downloaded one of the country datasets from geonames and I used this line to parse the dataset into columns:

data = pd.read_csv("C:/Users/Documents/TR.txt", sep="\t", header = None)

But for some reason this doesn't parse all of the rows correctly. Most of the rows are correctly parsed and about 2K are not. I used this line to be able to see that it's not parsed correctly:

data.to_csv("C:/Users/Documents/output.csv")

I then opened the output.csv in excel and saw that some of the rows are not parsed. But when I open the original TR.txt dataset on excel and use tab delimiter all the rows are correctly shown as parsed. So I am doing something wrong in my python code but I can't figure out what. Am I outputting the dataset wrong?? Thank you

Solution

Always read the readme.txt file.

In this particular case, there are two noteworthy issues going on.

The elevation (column 15) is expected as an int, but contains blanks. If you specify an int as the datatype, this will generate an error because there is no NaN values for ints. The work around is to cast it as a float. If you really want an int, then create a sentinal value for missing fields (e.g. -99999), fillna() with this value, and then cast as an int.

Some column contain comma separated lists (e.g. column 3, alternate names). When you used data.to_csv("C:/Users/Documents/output.csv") you destroyed the tab delimited parsing. You need to specify sep='\t'.

dtypes_dict = {
    0: int, # geonameid
    1: str,  # name
    2: str,  # asciiname
    3: str,  # alternatenames
    4: float, # latitude
    5: float, # longitude
    6: str, # feature class
    7: str, # feature code
    8: str, # country code
    9: str, # cc2
    10: str, # admin1 code
    11: str, # admin2 code
    12: str, # admin3 code
    13: str, # admin4 code
    14: int, # population
    15: float, # elevation
    16: int, # dem (digital elevation model)
    17: str, # timezone
    18: str # modification date yyyy-MM-dd
}

data = pd.read_csv("TR.txt", sep="\t", header = None, dtype=dtypes_dict)
data.to_csv('output.txt', sep='\t')

I didn't parse the final date column because it is probably not relevant.