Search code examples
pythongeocoding

Python dataprep lat_long_clean low performance on my dataset


I have latitude and longitude data in a dataframe with the following format:

Longitude   Latitude
055.25.30E  21.19.15S
075.26.27W  40.39.08N
085.02.00W  29.44.00N

I run the below code based on clean_lat_long:

from dataprep.clean import clean_lat_long
dfa['lat_long'] =   dfa['Latitude'] + ' ' + dfa['Longitude']
clean_lat_long(dfa, "lat_long", split=True)

The performance is very low with only 0,09% of my data cleaned:

Latitude and Longitude Cleaning Report:
    13 values cleaned (0.09%)
    15169 values unable to be parsed (99.91%), set to NaN
Result contains 13 (0.09%) values in the correct format and 15169 null values (99.91%)

How can I improve these results?


Solution

  • I obtained much better results by removing the first point (.) between degrees and minutes with the following instruction:

    dfa['lat_long'] = dfa['Latitude'].str.replace('.', ' ',1, regex=True) + ' ' + dfa['Longitude'].str.replace('.', ' ',1, regex=True) 
    

    Which transformed the dataset into:

    Longitude   Latitude
    055 25.30E  21 19.15S
    075 26.27W  40 39.08N
    085 02.00W  29 44.00N
    

    Results become, yes, much better, which demonstrates that the tool clean_lat_long is not magic and data should be prepared upstream to make it work:

    Latitude and Longitude Cleaning Report:
        15159 values cleaned (99.85%)
        23 values unable to be parsed (0.15%), set to NaN
    Result contains 15159 (99.85%) values in the correct format and 23 null values (0.15%)