python pandas dataframe concatenation pvlib

How to incorporate pd.concat() into my for-loops for much faster computing?

[START OF BACKGROUND INFO]

I am conducting research on the optimization of fixed solar panels. I have real irradiance data that has been gathered over 15-minute intervals over the course of several years.

I am running a program in python that is supposed to test different cases of tilt & azimuth of the solar panel in order to maximize irradiance from the Sun. If you're curious or don't know PV terminology:

you can think of tilt as the angle of "pitch" of the panel (The panel facing straight upwards would be 0 degrees, and directly facing the horizon would be 90 degrees)
and you can think of azimuth as the angle the solar panel "turns" ranging from 0-360 degrees (North=0/360, East=90, South=180, West=270).

Below is a simple case where I test the % Energy Gain from only a single day's readings at different azimuths only. In this simple case, I test azimuths at 45-degree intervals: 90 (East), 135 (Southeast), 180 (South), 225 (Southwest), and 270 (West) degrees. This code isn't perfect, but it works.

Code:

# For the AFTERNOON THUNDERSTORM Dataset from above:
# Now, I will model the gain in energy due to transposition from GHI --> POA
    #NOTE: Changing only AZIMUTH ANGLE at Fixed Tilt

df_hyp_gain_az = pd.DataFrame()

def calculate_poa_hyp(rawdata,solar_position,surface_tilt,surface_azimuth):
    poa = pvlib.irradiance.get_total_irradiance(
        surface_tilt=surface_tilt,
        surface_azimuth=surface_azimuth,
        dni=dirint_dni_hyp, # calculated from before
        ghi=df_hyp['Solar Radiation(W/m^2)'], # this is the raw data
        dhi=calculated_dhi_hyp, # calculated from before
        dni_extra=dni_et_hyp, # calculated from before
        solar_zenith=solpos_hyp['apparent_zenith'], # calculated from before
        solar_azimuth=solpos_hyp['azimuth'], # calculated from before
        surface_type='grass',
        model='haydavies')
    return poa['poa_global'] # returns the total in-plane irradiance

for azimuth in range(90,271,45): # scans from east(90) to west(270)
    # NOTE: Hardcoding Tilt=FLAT for all cases
    poa_irradiance_hyp_az = calculate_poa_hyp(
        rawdata=df_hyp,
        solar_position=solpos_hyp,
        surface_tilt=alma.latitude,
        surface_azimuth=azimuth)
    column_name_hyp_az = f"AZ-{azimuth}"
    df_hyp_gain_az[column_name_hyp_az] = poa_irradiance_hyp_az

# calculate the % difference from GHI
ghi_hyp = df_hyp['Solar Radiation(W/m^2)']
df_hyp_gain_az = 100 * (df_hyp_gain_az.divide(ghi_hyp, axis=0)-1)

plt.figure()
df_hyp_gain_az.plot().get_figure().set_facecolor('white')
plt.xlabel('Hour of Day')
plt.ylabel('Hourly Transposition Gain [%]')
plt.title('Aftn. Thunderstorm - Energy Gain From Changing Surface Azimuth',size='x-large',weight='demibold');
plt.xlim('1997-07-07 06:00:00-04:00','1997-07-07 21:00:00-04:00');

Output: Energy Gain From Changing Surface Azimuth

I also tested both tilt and azimuth simultaneously, with the tilt at 10-degree intervals and again with the azimuth at 45-degree intervals. In order to find which orientation of tilt & azimuth would return the highest amount of irradiance, I used from scipy import integrate successfully. The two nested-for loops and integration for each case works great, no problems here:

Code:

# Integrating over both change in AZIMUTH and TILT

df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain

for tilt in range(0,91,10): 
    for azimuth in range(0,316,45):
        poa_irradiance_hyp_both = calculate_poa_hyp(
            rawdata=df_hyp,
            solar_position=solpos_hyp,
            surface_tilt=tilt,
            surface_azimuth=azimuth)
        column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
        df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)

df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))

df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,10):
    for azimuth in range(0,316,45):
        poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
        column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
        df_hyp_poa[column_name_poa_hyp] = poa_hyp

df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')

Output:

Integrating the flat POA Irradiance [W/m^2]:
AZ=90|FT=40     2.075620e+07
AZ=90|FT=50     2.055457e+07
AZ=90|FT=30     2.048603e+07
AZ=90|FT=60     1.988685e+07
AZ=90|FT=20     1.975235e+07
                    ...     
AZ=270|FT=80    5.838635e+06
AZ=315|FT=80    5.648596e+06
AZ=225|FT=90    5.409111e+06
AZ=270|FT=90    5.291405e+06
AZ=315|FT=90    5.225395e+06
Length: 79, dtype: float64
--------------------------------------------
Integrating the Energy Gain [%]:
AZ=90|FT=50     1.218214e+06
AZ=90|FT=40     1.206950e+06
AZ=90|FT=60     1.107202e+06
AZ=90|FT=30     1.074007e+06
AZ=45|FT=40     9.597149e+05
                    ...     
AZ=315|FT=80   -2.598201e+06
AZ=180|FT=90   -2.720371e+06
AZ=225|FT=90   -2.777403e+06
AZ=270|FT=90   -2.790663e+06
AZ=315|FT=90   -2.802186e+06
Length: 79, dtype: float64
--------------------------------------------
Azimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.

The above codes took less than 10 seconds to execute properly, as they were for a single day's worth of data.

[END OF BACKGROUND INFO]

Now, onto my problem: Because I only analyzed tilt-angles with 10-degree intervals and azimuth-angles with 45-degree intervals, I wanted more accurate results. So, I lowered the both the tilt & azimuth intervals to analyze for every 1-degree.

Code (only difference from previous code is changing the range() parameters):

df_hyp_gain_both = pd.DataFrame() # this tests % Energy Gain

for tilt in range(0,91,1): 
    for azimuth in range(0,360,1):
        poa_irradiance_hyp_both = calculate_poa_hyp(
            rawdata=df_hyp,
            solar_position=solpos_hyp,
            surface_tilt=tilt,
            surface_azimuth=azimuth)
        column_name_hyp_both = f"AZ={azimuth}|FT={tilt}"
        df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

df_hyp_gain_both = 100 * (df_hyp_gain_both.divide(ghi_hyp, axis=0)-1)

df_hyp_gain_both_sec = df_hyp_gain_both
df_hyp_gain_both_sec.index = df_hyp_gain_both.index.astype(np.int64)//10**9
df_hyp_gain_both_sec = df_hyp_gain_both_sec.fillna(0)
df_hyp_integral = df_hyp_gain_both_sec.iloc[:,1:].apply(lambda x: integrate.trapz(x,dx=900))

df_hyp_poa = pd.DataFrame() # this tests the raw irradiance readings
for tilt in range(0,91,1):
    for azimuth in range(0,360,1):
        poa_hyp = calculate_poa_hyp(df_hyp,solpos_hyp,tilt,azimuth)
        column_name_poa_hyp = f"AZ={azimuth}|FT={tilt}"
        df_hyp_poa[column_name_poa_hyp] = poa_hyp

df_hyp_poa_sec = df_hyp_poa
df_hyp_poa_sec.index = df_hyp_poa.index.astype(np.int64)//10**9
df_hyp_poa_sec = df_hyp_poa_sec.fillna(0)
df_hyp_poa_integral = df_hyp_poa_sec.iloc[:,1:].apply(lambda y: integrate.trapz(y,dx=900))
print('Integrating the flat POA Irradiance [W/m^2]:')
display(df_hyp_poa_integral.sort_values(ascending=False))
print('--------------------------------------------\nIntegrating the Energy Gain [%]:')
display(df_hyp_integral.sort_values(ascending=False))
print('--------------------------------------------\nAzimuth facing East (90 degrees) and Fixed Tilt between 30-50 degrees will maximize the energy produced from a solar panel.')

Output:

C:\Users\jmand\AppData\Local\Temp\ipykernel_10576\4287560089.py:13: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  df_hyp_gain_both[column_name_hyp_both] = poa_irradiance_hyp_both

And this output repeats over-and-over for a very long time with no results. I've tried searching for a solution on this website (found PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance and many similar ones). I know I need to use pd.concat() in some form or fashion for my df_hyp_gain_both DataFrame. But, I'm having trouble even setting it up. I need to somehow use column_name_hyp_both AND poa_irradiance_hyp_both.

With correct syntax, how would I incorporate pd.concat() into my for-loop in order to avoid this PerformanceWarning: DataFrame is highly fragmented warning?

Solution

After removing all the solar stuff, the problem essentially boils down to this:

import pandas as pd

df = pd.DataFrame()
dummy_data = pd.Series(0, index=pd.date_range('2019-01-01', freq='h', periods=8760))

for i in range(200):  # 200 just as an example
    df[f'col_{i}'] = dummy_data.copy()

An alternative that is one or two orders of magnitude faster, on my computer anyway, is to accumulate the columns into a dictionary and only convert to a DataFrame after the loop:

results = {}
for i in range(200):
    results[f'col_{i}'] = dummy_data.copy()

df = pd.DataFrame(results)

A similar approach is useful when building a DataFrame by rows (instead of columns like above) -- rather than appending rows onto a DataFrame, which forces unnecessary reallocations and memory copies, it is better to accumulate row info in a list and convert to a DataFrame all at once. For example, consider these three ways of building a DataFrame from chunks of rows:

empty_df = pd.DataFrame({i: [0]*10 for i in range(20)})

def pandas_concat(N):
    df = empty_df
    for i in range(1, N):
        df = pd.concat([df, empty_df])
    return df

def pandas_append(N):
    df = empty_df
    for i in range(1, N):
        df = df.append(empty_df)
    return df

def list_append(N):
    lis = []
    for i in range(N):
        lis.append(empty_df)
    df = pd.concat(lis)
    return df

Here's how the timings compare as a function of N (dots indicate timings, lines are estimated asymptotic behavior). So there's a clear speed up by accumulating things in normal python data structures and only building the final DataFrame once at the end.