I'm working on the data below and I would like fill the Nan in Begin
and End
with a date take from the Subscription Period
column.
All the columns are strings.
I have several format:
05/03/2020 to 04/03/2021
, I use: # clean if date begin and end in SubscriptionPeriod
# create 3 new colonnes
df_period = df['Subscription Period'] \
.str.extractall(r'(?P<Period>(?P<Begin>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d).+(?P<End>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d))')
df['Period'] = df_period['Period'].unstack()
df['Begin'] = df_period['Begin'].unstack()
df['End'] = df_period['End'].unstack()
Subscription Period
:Subscription Hospital Sept-Dec 2018
: I would like extract Sept as 01/09/2018 in Begin
and 31/12/2018 in End
.
Yearly Subscription Hospital (effective 17/04/2019)
Yearly Subscription Hospital (effective 01 octobre 2018)
For this twice, I would like get the date in Begin
and more one year in the End
.
I try solutions:
mask = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
df.loc[mask, 'Begin'] = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
df.loc[(df['Begin'].isnull()) , 'Period']= 'B'
Here the data:
data = {'Date': {0: '2020-05-05',
1: '2018-09-12',
2: '2020-04-22',
3: '2020-01-01',
4: '2019-04-17',
5: '2018-09-07',
6: '2018-11-20',
7: '2018-11-28'},
'Subscription Period': {0: 'Subscription Hospital : from 01/05/2020 to 30/04/2021',
1: 'Subscription Hospital Sept-Dec 2018',
2: 'Yearly Subscription Hospital from 05/03/2020 to 04/03/2021',
3: 'Subscription Hospital from 01/01/2020 to 31/12/2020',
4: 'Yearly Subscription Hospital (effective 17/04/2019)',
5: 'Yearly Subscription Hospital (effective 01 octobre 2018)',
6: 'Subscription : Hospital',
7: 'Yearly Subscription Hospital'},
'Period': {0: '01/05/2020 to 30/04/2021',
1: np.NaN,
2: '05/03/2020 to 04/03/2021',
3: '01/01/2020 to 31/12/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN},
'Begin': {0: '01/05/2020',
1: np.NaN,
2: '05/03/2020',
3: '01/01/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN},
'End': {0: '30/04/2021',
1: np.NaN,
2: '04/03/2021',
3: '31/12/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN}}
df = pd.DataFrame.from_dict(data)
Thank you for the help and any tips.
Regarding your mask
example, if you're using str.extract
or str.extractall
, there's no need to index using a mask since the resulting dataframe is already indexed. Instead, you can use concat
to join on the index and use combine_first
to apply only where Begin
is null:
begin2 = df['Subscription Period'].str.extract(r'(\d{2}/\d{2}/\d{2,4})[)]?$').rename({0:'Begin2'}, axis=1)
df = pd.concat([df, begin2], axis=1)
df.Begin = df.Begin.combine_first(df.Begin2)
df = df.drop('Begin2', axis=1)
Hopefully you can take it from here? Otherwise you might have to clarify where exactly you're having trouble.
And by the way, those regexes are pretty hairy. I'd suggest converting defining a custom function and using df.apply
.