I am converting a text file into netCDF format. I am reading the data from text file into a dataframe in which two of the columns are latitude_GPS and longitude_GPS. Input looks like:
latitude_GPS, longitude_GPS
7537.4536, 3558.4985
7672.1534, 3214.9532
They are measured in ddmm.ssss units which means if we have value like 7537.4536, here '75' is degrees, '37' is minutes and '4536' is seconds. I want to convert them into degree decimal except for the missing values which have a value of 999.0
My current code looks like this:
header_rows = 1
df = pd.read_csv(args.input_file, delim_whitespace=True, skiprows=header_rows, skip_blank_lines=True, names = column_names)
num_rows = sum(1 for line in open(args.input_file) if len(line.strip()) != 0) - header_rows
def lat_lon_gps(col_index):
return ((int(col_index)/100) + round((int(col_index%100))/60, 4) + round(round(col_index%1, 4)/3600, 4))
check_na = 999.0
i = 0
while i < num_rows:
if df['latitude_GPS'][i] != check_na:
df['latitude_GPS'][i] = lat_lon_gps(df['latitude_GPS'][i])
if df['longitude_GPS'][i] != check_na:
df['longitude_GPS'][i] = lat_lon_gps(df['longitude_GPS'][i])
The return part calculates (75 + 37/60 + 4536/3600). The above code returns what I want but it takes around 50 minutes to run this part of code for a file having 10000 rows. Is there a faster way to do it. Any thoughts would be appreciated.
The problem is that you are iterating over every row. You should take advantage of vectorisation provided by pandas and numpy.
For example:
import numpy as np
import pandas as pd
df = pd.read_csv(args.input_file,
names=['latitude_GPS','longitude_GPS'],
skiprows=1)
check_na = 999.0
def lat_lon_gps(coords):
deg = np.floor(coords / 100)
minutes = np.floor(((coords / 100) - deg) * 100)
seconds = (((coords / 100) - deg) * 100 - minutes) * 100
return deg + minutes / 60 + seconds / 3600
# Exclude NAs
logic = df.latitude_GPS != check_na
df = df[logic]
df.latitude_GPS = lat_lon_gps(df.latitude_GPS)
df.longitude_GPS = lat_lon_gps(df.longitude_GPS)