Search code examples
pythonpandasextract

Extract multiple date/time values from text field into new variable columns


I have dataframe - see below. This is just a snippet of the full dateframe, there are more text and date/times in each respective rows/IDS. As you can see the text before and after each date/time is random.

ID        RESULT
1         Patients Discharged Home :   12/07/2022 11:19 Bob Melciv   Appt 12/07/2022 12:19 Medicaid...
2         Stawword Geraldio -    12/17/2022 11:00 Bob Melciv   Appt 12/10/2022 12:09 Risk Factors...

I would like to pull all date/times where the format is MM/DD/YYYY HH:MM from the RESULT column and make each of those respective date/times into their own column.

ID    DATE_TIME_1              DATE_TIME_2        DATE_TIME_3 .....
1     12/07/2022 11:19         12/07/2022 12:19
2     12/17/2022 11:00         12/10/2022 12:09

Solution

  • From @David542's regex, you can use str.extractall:

    pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
    out = pd.concat([df['ID'],
                     df['RESULT'].str.extractall(pattern).squeeze()
                                 .unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
                                 .rename_axis(columns=None)], axis=1)
    print(out)
    
    # Output
       ID       DATE_TIME_1       DATE_TIME_2
    0   1  12/07/2022 11:19  12/07/2022 12:19
    1   2  12/17/2022 11:00  12/10/2022 12:09
    

    A slightly modified version to convert extracted date/time to pd.DatetimeIndex:

    pattern = r'(\d{2}/\d{2}/\d{4} \d{2}:\d{2})'
    out = pd.concat([df['ID'],
                     df['RESULT'].str.extractall(pattern).squeeze().apply(pd.to_datetime)
                                 .unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
                                 .rename_axis(columns=None)], axis=1)
    print(out)
    
    # Output
       ID         DATE_TIME_1         DATE_TIME_2
    0   1 2022-12-07 11:19:00 2022-12-07 12:19:00
    1   2 2022-12-17 11:00:00 2022-12-10 12:09:00
    

    Step by step:

    # 1. Date extraction (and squeeze DataFrame with 1 column to Series)
    >>> out = df['RESULT'].str.extractall(pattern)
       match
    0  0        12/07/2022 11:19
       1        12/07/2022 12:19
    1  0        12/17/2022 11:00
       1        12/10/2022 12:09
    Name: 0, dtype: object
    
    # 2. Move second index level as column (and add the prefix DATE_TIME_N)
    >>> out = out.unstack().rename(columns=lambda x: f'DATE_TIME_{x+1}')
    match       DATE_TIME_1       DATE_TIME_2
    0      12/07/2022 11:19  12/07/2022 12:19
    1      12/17/2022 11:00  12/10/2022 12:09
    
    # 3. Remove the 'match' title on column axis
    >>> out = out.rename_axis(columns=None)
                DATE_TIME_1       DATE_TIME_2
    0      12/07/2022 11:19  12/07/2022 12:19
    1      12/17/2022 11:00  12/10/2022 12:09
    

    Finally concatenate original ID with this new dataframe along column axis.