Search code examples
pythonregexpandastimefillna

In pandas, how to fill Nan with a pattern extract from an other column?


I'm working on the data below and I would like fill the Nan in Begin and End with a date take from the Subscription Period column. All the columns are strings.

I have several format:

  • for 05/03/2020 to 04/03/2021, I use:
    # clean if date begin and end in SubscriptionPeriod
    # create 3 new colonnes
    df_period = df['Subscription Period'] \
        .str.extractall(r'(?P<Period>(?P<Begin>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d).+(?P<End>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d))')
    df['Period'] = df_period['Period'].unstack()
    df['Begin'] = df_period['Begin'].unstack()
    df['End'] = df_period['End'].unstack()
  • for the other formats in Subscription Period:

Subscription Hospital Sept-Dec 2018: I would like extract Sept as 01/09/2018 in Begin and 31/12/2018 in End.
Yearly Subscription Hospital (effective 17/04/2019)
Yearly Subscription Hospital (effective 01 octobre 2018)
For this twice, I would like get the date in Begin and more one year in the End.

I try solutions:

  • with mask()
mask = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
df.loc[mask, 'Begin'] = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
  • with loc(): work for 'B' but not for a regex with extract.
df.loc[(df['Begin'].isnull()) , 'Period']= 'B'

Here the data:

data = {'Date': {0: '2020-05-05',
  1: '2018-09-12',
  2: '2020-04-22',
  3: '2020-01-01',
  4: '2019-04-17',
  5: '2018-09-07',
  6: '2018-11-20',
  7: '2018-11-28'},
 'Subscription Period': {0: 'Subscription Hospital : from 01/05/2020 to 30/04/2021',
  1: 'Subscription Hospital Sept-Dec 2018',
  2: 'Yearly Subscription Hospital from 05/03/2020 to 04/03/2021',
  3: 'Subscription Hospital from 01/01/2020 to 31/12/2020',
  4: 'Yearly Subscription Hospital (effective 17/04/2019)',
  5: 'Yearly Subscription Hospital (effective 01 octobre 2018)',
  6: 'Subscription : Hospital',
  7: 'Yearly Subscription Hospital'},
 'Period': {0: '01/05/2020 to 30/04/2021',
  1: np.NaN,
  2: '05/03/2020 to 04/03/2021',
  3: '01/01/2020 to 31/12/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN},
 'Begin': {0: '01/05/2020',
  1: np.NaN,
  2: '05/03/2020',
  3: '01/01/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN},
 'End': {0: '30/04/2021',
  1: np.NaN,
  2: '04/03/2021',
  3: '31/12/2020',
  4: np.NaN,
  5: np.NaN,
  6: np.NaN,
  7: np.NaN}}

df = pd.DataFrame.from_dict(data)

Thank you for the help and any tips.


Solution

  • Regarding your mask example, if you're using str.extract or str.extractall, there's no need to index using a mask since the resulting dataframe is already indexed. Instead, you can use concat to join on the index and use combine_first to apply only where Begin is null:

    begin2 = df['Subscription Period'].str.extract(r'(\d{2}/\d{2}/\d{2,4})[)]?$').rename({0:'Begin2'}, axis=1)
    df = pd.concat([df, begin2], axis=1)
    df.Begin = df.Begin.combine_first(df.Begin2)
    df = df.drop('Begin2', axis=1)
    

    Hopefully you can take it from here? Otherwise you might have to clarify where exactly you're having trouble.

    And by the way, those regexes are pretty hairy. I'd suggest converting defining a custom function and using df.apply.