Search code examples
pythonpandasdataframenumpydesign-patterns

how do I find a continuos number in dataframe and apply to new column


I have a huge dataframe around 5000 rows, I need to find out how many times a pattern occur in a column and add a new column for it, I am able to use np.where to get the pattern to 1 but I don't know how to count the pattern and add to new column, I did a search online try to use loop but I can't figure out how to use loop with dataframe

df['P'] = np.where((df['val2'] > df['val1']) & (df['val2']> df['val1'].shift(1)),1,0 )

         Date  val1  val2  P     [new column] ( 
0  2015-02-24   294    68  0         0
1  2015-02-25   155    31  0         0
2  2015-02-26   168   290  1         1            pattern occur 1 time
3  2015-02-27   273   108  0         0
4  2015-02-28    55     9  0         0
5  2015-03-01   273   123  0         0
6  2015-03-02   200    46  0         0
7  2015-03-03    80    83  0         0
8  2015-03-04   181   208  1         1            pattern occur 1 time
9  2015-03-05   195    41  0         0
10 2015-03-06    50   261  1         1            pattern occur 1 time
11 2015-03-07    50   177  0         0
12 2015-03-08   215    60  1         0
13 2015-03-09    13   290  1         2            pattern occur 2 times
14 2015-03-10   208    41  0         0
15 2015-03-11    49   263  1         0
16 2015-03-12   171   244  1         0
17 2015-03-13   218   266  1         0
18 2015-03-14   188   219  1         3            pattern occur 3 times
19 2015-03-15   232   171  0         0
20 2015-03-16   116   196  0         0
21 2015-03-17   262   102  0         0
22 2015-03-18   263   159  0         0
23 2015-03-19   227   160  0         0
24 2015-03-20   103   236  1         0
25 2015-03-21    55   104  1         0
26 2015-03-22    97   109  1         0
27 2015-03-23    38   118  1         4            pattern occur 4 times
28 2015-03-24   163   116  0         0
29 2015-03-25   256    16  0         0

Solution

  • you can use:

    df['new_column'] = (df.P != df.P.shift()).cumsum() #get an id according to P
    mask=df.groupby('new_column')['P'].sum() #what is the total value for each group
    
    duplicated = df.duplicated('new_column',keep='last')
    df.loc[~duplicated, ['new_column']] = np.nan #set nan to last rows for each group. We will replace nans with mask
    
    df['new_column'] = df['new_column'].astype(str).replace('\d+', 0,regex=True).replace('nan',np.nan) #convert not nans to zero
    mask.index=df[df['new_column'].isnull()].index.to_list()
    #If you want to fill the nan values ​​with a series, the index values ​​must be the same. So I replace the index values ​​of the mask series with the index numbers of the nan values ​​in df.
    
    df['new_column']=df['new_column'].fillna(mask).astype(int)
    df
    '''
              Date  val1  val2  P  new_column
    0   2015-02-24   294    68  0           0
    1   2015-02-25   155    31  0           0
    2   2015-02-26   168   290  1           1
    3   2015-02-27   273   108  0           0
    4   2015-02-28    55     9  0           0
    5   2015-03-01   273   123  0           0
    6   2015-03-02   200    46  0           0
    7   2015-03-03    80    83  0           0
    8   2015-03-04   181   208  1           1
    9   2015-03-05   195    41  0           0
    10  2015-03-06    50   261  1           1
    11  2015-03-07    50   177  0           0
    12  2015-03-08   215    60  1           0
    13  2015-03-09    13   290  1           2
    14  2015-03-10   208    41  0           0
    15  2015-03-11    49   263  1           0
    16  2015-03-12   171   244  1           0
    17  2015-03-13   218   266  1           0
    18  2015-03-14   188   219  1           4
    19  2015-03-15   232   171  0           0
    20  2015-03-16   116   196  0           0
    21  2015-03-17   262   102  0           0
    22  2015-03-18   263   159  0           0
    23  2015-03-19   227   160  0           0
    24  2015-03-20   103   236  1           0
    25  2015-03-21    55   104  1           0
    26  2015-03-22    97   109  1           0
    27  2015-03-23    38   118  1           4
    28  2015-03-24   163   116  0           0
    29  2015-03-25   256    16  0           0
    
    '''