Search code examples
numpymachine-learningdata-processing

np.where: "ValueError: operands could not be broadcast together with shapes (38658637,) (9456,)"


I have two dataframes with two different shapes:

  1. df_rts_1 #Shape: (38658637, 7)
  2. df_crsh_rts #Shape: (9456, 6)

I am trying to use np.where to update a column value (df_rts_1['crash']) to equal 1 based on certain condition as follows:

  1. df_rts_1['tmc_code']= df_crsh_rts['tmc']
  2. df_rts_1['measurement_tstamp'] is between df_crsh_rts['Start_time'] and df_crsh_rts['Closed_time']

My code:

df_rts_1['crash'] = np.where((df_rts_1['tmc_code'].values == df_crsh_rts['tmc'].values) & ((df_rts_1['measurement_tstamp'].values > df_crsh_rts['Start_time'].values) & (df_rts_1['measurement_tstamp'].values > df_crsh_rts['Closed_time'].values)), 1, df_rts_1['crash'])

I am getting the error in the title. I am very new to Python/data science.


Solution

  • Assume that your both DataFrames contain:

    1. df_rts_1:

          tmc_code  measurement_tstamp  crash
       0         1 2020-01-03 10:05:00      0
       1         1 2020-01-03 11:00:00      0
       2         1 2020-01-03 12:10:00      0
       3         2 2020-01-03 10:10:00      0
       4         3 2020-01-03 10:05:00      0
      
    2. df_crsh_rts:

          tmc          Start_time         Closed_time
       0    1 2020-01-03 10:00:00 2020-01-03 11:00:00
       1    2 2020-01-03 14:00:00 2020-01-03 15:00:00
       2    4 2020-01-03 16:00:00 2020-01-03 18:00:00
      

    To facilitate evaluation of the "between" condition, let's create the following IntervalIndex:

    interv = pd.IntervalIndex.from_arrays(df_crsh_rts.Start_time,
        df_crsh_rts.Closed_time, closed='both')
    

    And now, assuming that we have a current row from df_rts_1, let's construct your condition:

    • the "between" condition can be expressed as interv.contains(row.measurement_tstamp),
    • the equality of tmc / tmc_code values can be expressed as df_crsh_rts.tmc.eq(row.tmc_code).

    To check how they work, save the first row from df_rts_1 as row variable:

    row = df_rts_1.iloc[0]
    

    and execute both contitions.

    The first condition generates a Numpy array of bool type:

    array([ True, False, False])
    

    and the second - a Series (also of bool type):

    0     True
    1    False
    2    False
    Name: tmc, dtype: bool
    

    So to construct the final (single) bool value - whether this row should have updated its crash column, the condition is:

    (interv.contains(row.measurement_tstamp) & df_crsh_rts.tmc.eq(row.tmc_code)).any()
    

    i.e. logical AND of both above conditions and any() - whether any element of this conjunction is True.

    And the last change, compared to your code:

    • Instead of where, use iloc[...] where the first argument (row selector) is the above composite condition, computed for each row from df_rts_1 (using a list comprehension) - which rows to update.
    • The second argument is just crash - column name to update.
    • Save 1 in the indicated cells.

    The code to do it is:

    df_rts_1.loc[[ (interv.contains(row.measurement_tstamp) &
        df_crsh_rts.tmc.eq(row.tmc_code)).any()
        for row in df_rts_1.itertuples()], 'crash'] = 1
    

    For my sample data, the result is:

       tmc_code  measurement_tstamp  crash
    0         1 2020-01-03 10:05:00      1
    1         1 2020-01-03 11:00:00      1
    2         1 2020-01-03 12:10:00      0
    3         2 2020-01-03 10:10:00      0
    4         3 2020-01-03 10:05:00      0
    

    Edit following questions in a comment

    Q1. Are we using the indices from both conditions to access and update the df_rts_1 dataframe?

    Actually not. Note that:

    [ (interv.contains(row.measurement_tstamp) &
        df_crsh_rts.tmc.eq(row.tmc_code)).any() for row in df_rts_1.itertuples() ]
    

    yields a list of bools, which even does not contain original indices. It is then used in .loc[...], so this is a case of boolean indexing. Consecutive True / False elements of this list relate to consecutive rows from df_rts_1 and state whether particular row is to be selected.

    Q2 and Q3. What does the any() do here? What any() helps us achieve.

    Look at the example for the initial row:

    • the first condition (alone) states whether measurement_tstamp in the current row is between both dates in consecutive rows of df_crsh_rts,
    • the second condition (alone) states whether the current row has matching tmc in consecutive rows of df_crsh_rts.

    We require that both these conditions must be met for the same row from df_crsh_rts, hence the & joinig them.

    Note however that:

    • condition_1 & condition_2 yields a Series of bool type - whether consecutive rows meet both partial conditions:

      0    True
      1    False
      2    False
      Name: tmc, dtype: bool
      
    • But we don't need any Series of bools. We need a single bool, stating whether the above result contains at least one True value.

    And just this transformation (from a bool Series to a single bool) is performed by any().

    Edit 2

    As your data volume is huge, I came up with another, probably faster solution. The idea is to:

    • group df_rts_1 by tmc_code,
    • check only the between condition against intervals for the current tmc.
    • to faster select proper intervals, use an auxiliary Series with intervals indexed by tmc (search by the index is faster).

    To do it, define the following function, to be applied to each group:

    def newCrash(grp):
        # Intervals for the current tmc_code
        wrk = intrv[intrv.index == grp.iloc[0,0]]
        if wrk.empty:
            return grp.crash  # Nothing found - no change
        wrkInd = pd.IntervalIndex(wrk)
        return grp.crash.mask([ wrkInd.contains(ts).any()
            for ts in grp.measurement_tstamp ], 1)
    

    Then create the auxiliary Series:

    intrv = pd.Series(pd.IntervalIndex.from_arrays(df_crsh_rts.Start_time,
        df_crsh_rts.Closed_time, closed='both'), index=df_crsh_rts.tmc)
    

    And finally run the update of crash column:

    df_rts_1.crash = df_rts_1.groupby('tmc_code', sort=False).\
        apply(newCrash).reset_index(level=0, drop=True)
    

    For your (very small) sample data this solution works slower.

    But after I inreased the size of df_rts_1 to 40 rows, this solution works a bit faster.

    If you further increase the size of df_rts_1, the difference in speed should be bigger (in favour of the second solution).

    Check both solutions (original and this) on a sample of e.g. 100,000 rows from df_rts_1 and write how long took the execution of both solutions.