Search code examples
pythonpandasdataframeperformancelatitude-longitude

How to speed up for loops in dataframe


I want to convert a data frame to the format I want by scanning each latitude and longitude in the for loop, but this process takes too long. Is there a way to make the following script faster, such as using multi threads or processing? Can you show me how?

p=0
for i in tqdm(df_wind_monthly["lat"]):
    for j in df_wind_monthly["lon"]:
        print("lat: " + str(i) + " lon: " + str(j))
        for k in range(1948,2017): 
            rslt_df_wind = df_wind_monthly.loc[(df_wind_monthly['lat'] == i) \
                                          & (df_wind_monthly['lon'] == j) \
                                              & (df_wind_monthly['year'] == k)]; rslt_df_wind = rslt_df_wind.reset_index()
            month_columns.loc[p,"lat"]=rslt_df_wind.loc[0,"lat"]
            month_columns.loc[p,"lon"]=rslt_df_wind.loc[0,"lon"]
            month_columns.loc[p,"years"]=rslt_df_wind.loc[0,"year"]
            month_columns.loc[p,"wind_January"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="January"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_February"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="February"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_March"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="March"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_April"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="April"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_May"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="May"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_June"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="June"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_July"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="July"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_August"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="August"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_September"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="September"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_October"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="October"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_November"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="November"].index.tolist()[0],"wind"]
            month_columns.loc[p,"wind_December"]=rslt_df_wind.loc[rslt_df_wind['month'].loc[lambda x: x=="December"].index.tolist()[0],"wind"]
            p+=1

These are the inputs and expected output samples:

            #df_wind_monthly (INPUT):
                         time     lat     lon      wind
              0       1948-01-16  15.125  15.125  6.509021
              1       1948-01-16  15.125  15.375  6.485108
              2       1948-01-16  15.125  15.625  6.472615
              3       1948-01-16  15.125  15.875  6.472596
              4       1948-01-16  15.125  16.125  6.486597

             #Expected dataframe columns (OUTPUT):
             month_columns=pd.DataFrame(columns=['lat',"lon","years","wind_January","wind_February","wind_March","wind_April","wind_May","wind_June","wind_July","wind_August","wind_September","wind_October","wind_November","wind_December"])

Solution

  • Looping is definetly NOT the way to go. If you type for ... in while using pandas DataFrame, you're almost always doing it wrong.

    What you want is to switch your data from long format (1 row = 1 observation) to wide format (1 row = 12 observations). It is a fairly common usecase, so pandas provides a method to do just that: DataFrame.pivot.

    Starting from your input dataframe, you need to:

    • Add a year and month column from the time column
    • Select only the years you want (1948-2016)
    • Optionaly, if there is more than one measurement by month, compute the average, using groupby and mean
    • Switch from wide to long format with pivot.
    • Rename the columns

    Input

    >>> df_wind_monthly
    
                 time     lat     lon   wind
    0      1948-01-31  15.125  15.125  5.963
    1      1948-01-31  15.125  15.375  6.404
    2      1948-01-31  15.125  15.625  6.207
    3      1948-01-31  15.125  15.875  6.972
    4      1948-01-31  15.125  16.125  6.299
    ...           ...     ...     ...    ...
    86395  2019-12-31  17.375  16.375  6.514
    86396  2019-12-31  17.375  16.625  6.593
    86397  2019-12-31  17.375  16.875  6.438
    86398  2019-12-31  17.375  17.125  6.394
    86399  2019-12-31  17.375  17.375  6.232
    

    Processing

    out_df = df_wind_monthly
    
    # Create a DateTime index from the date column to easily extract year / month
    date_index = pd.DatetimeIndex(out_df["time"])
    
    # Create a year and month columns to select the years and perform the groupby
    out_df = out_df.assign(year=date_index.year, month=date_index.month)
    
    # Select the years you want
    out_df = out_df[(out_df["year"] >= 1948) & (out_df["year"] < 2017)]
    
    # If there are multiple measurement for a given year / month, compute the average
    # Skip this if you know there is only one measurement per month
    out_df = out_df.groupby(["lat", "lon", "year", "month"]).mean().reset_index()
    
    # Switch from long to wide format
    out_df = out_df.pivot(index=["lat", "lon", "year"], columns="month", values="wind")
    
    # Rename the columns
    out_df = out_df.rename(
        columns={
            1: "wind_January",
            2: "wind_February",
            3: "wind_March",
            4: "wind_April",
            5: "wind_May",
            6: "wind_June",
            7: "wind_July",
            8: "wind_August",
            9: "wind_September",
            10: "wind_October",
            11: "wind_November",
            12: "wind_December",
        }
    )
    
    # Reset Index if you prefer to have the data in columns
    out_df = out_df.reset_index()
    
    

    Output:

    >>> out_df
    
    month     lat     lon  year  wind_January  ...  wind_September  wind_October  wind_November  wind_December      
    0      15.125  15.125  1948         5.963  ...           6.885         6.814          6.131          6.063      
    1      15.125  15.125  1949         6.304  ...           6.178         6.536          6.426          6.090      
    2      15.125  15.125  1950         6.207  ...           6.890         6.719          6.875          5.925      
    3      15.125  15.125  1951         6.100  ...           6.153         6.905          6.034          6.470      
    4      15.125  15.125  1952         5.951  ...           6.638         6.294          6.434          5.936      
    ...       ...     ...   ...           ...  ...             ...           ...            ...            ...      
    6895   17.375  17.375  2012         6.674  ...           6.841         6.383          6.685          6.616      
    6896   17.375  17.375  2013         6.674  ...           6.469         6.940          6.842          6.154      
    6897   17.375  17.375  2014         6.794  ...           6.251         6.267          6.258          5.942      
    6898   17.375  17.375  2015         6.760  ...           6.933         6.848          6.765          6.446      
    6899   17.375  17.375  2016         6.253  ...           6.986         6.490          6.421          6.338