python arrays pandas for-loop nested-loops

Fill in a 2D array using a conditional

data = pd.DataFrame({
'year': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
'TC_number': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
'maximum_wind_speed': [20.37199783, 21.2, 21.7, 14.626, 18.108, 21.4, 25.3, 25.3, 22.9, 18.108, 20.2, 22.1, 24.3, 25.5, 27.7, 29.8, 33.6, 36.7, 36.6, 35, 33, 29.7, 29, 20]})

Hi All,

I've tried to find solutions by searching online, but, none seem to be what I am looking for.

I know what I want to do, but I am getting stuck on how to implement the code.

I first initialize a (1000, 240) array. I then want to create a loop that fills in each row of the array. Each row represents a single Tropical Cyclone (TC)'s recorded maximum wind speed values and 240 represents the maximum number of values that a TC could have. However, each TC will have varying number of values recorded in the maximum wind speeds row. I want the loop to jump to the next row when the current TC number does not equal the previous TC number.

This is what I have so far:

output_array = np.full((1000, 240), np.nan)

#Shape of vmaxsyn is (337079,)

for i in range(1000):
    #print("i = ", i)
    for j in range(241):
        #print("j = ", j)
        name_id1 = df.iloc[j]['TC_number']
        name_id2 = df.iloc[j-1]['TC_number']
        
        if name_id1 == name_id2:
            output_array[i, j] = vmaxsyn[j]
            #print(output_array[j,i])
            #print([i,j])
        else: 
            #print("breaking out of inner loop")
            break 
#print("breaking out of outer loop.")

I was expecting something like this:

data = [
[20.372, 21.2, 21.7, 14.62, np.nan, np.nan],
[18.108, 21.4, 25.3, 25.3, 22.9, np.nan],
[18.108, 20.2, 22.1, 24.3, np.nan, np.nan],
[25.5, 27.1, 29.8, 33.6, np.nan, np.nan],
[36.7, 36.6, 35, np.nan, np.nan, np.nan],
[33, 29.7, 29, np.nan, np.nan, np.nan]]

The problem is none of the vmaxsyn values are being recorded to my output array. And I am also trying to deal with a broadcast error with my other approach. Any help is greatly appreciated. I'm specifically trying to accomplish this with pandas.

Solution

You don't need a for loop here at all. First, append one id column to your data which increments when TC_number changes. Then group your data by this newly created id and use pandas.DataFrame.apply for converting it in a list.

data['tc_id'] = data['TC_number'].ne(data['TC_number'].shift()).cumsum()-1

array = data.groupby('tc_id')['maximum_wind_speed'].apply(list)

The result will look like

print(array)

tc_id
0    [20.37199783, 21.2, 21.7, 14.626]
1     [18.108, 21.4, 25.3, 25.3, 22.9]
2           [18.108, 20.2, 22.1, 24.3]
3             [25.5, 27.7, 29.8, 33.6]
4                   [36.7, 36.6, 35.0]
5             [33.0, 29.7, 29.0, 20.0]