I have latitude and longitude data in a dataframe with the following format:
Longitude Latitude
055.25.30E 21.19.15S
075.26.27W 40.39.08N
085.02.00W 29.44.00N
I run the below code based on clean_lat_long
:
from dataprep.clean import clean_lat_long
dfa['lat_long'] = dfa['Latitude'] + ' ' + dfa['Longitude']
clean_lat_long(dfa, "lat_long", split=True)
The performance is very low with only 0,09% of my data cleaned:
Latitude and Longitude Cleaning Report:
13 values cleaned (0.09%)
15169 values unable to be parsed (99.91%), set to NaN
Result contains 13 (0.09%) values in the correct format and 15169 null values (99.91%)
How can I improve these results?
I obtained much better results by removing the first point (.) between degrees and minutes with the following instruction:
dfa['lat_long'] = dfa['Latitude'].str.replace('.', ' ',1, regex=True) + ' ' + dfa['Longitude'].str.replace('.', ' ',1, regex=True)
Which transformed the dataset into:
Longitude Latitude
055 25.30E 21 19.15S
075 26.27W 40 39.08N
085 02.00W 29 44.00N
Results become, yes, much better, which demonstrates that the tool clean_lat_long
is not magic and data should be prepared upstream to make it work:
Latitude and Longitude Cleaning Report:
15159 values cleaned (99.85%)
23 values unable to be parsed (0.15%), set to NaN
Result contains 15159 (99.85%) values in the correct format and 23 null values (0.15%)