python pandas string dataframe substring

Copying substring that starts with "C" and ends with all the values after "D" in one dataframe and putting the substring into a new dataframe

The above photo is data stored in df_input.

I would like extract the "C#D#" part from the 'Visit' column and place it into the column of a new dataframe I created (df_output['VISIT']).

Additionally, there could be up to two numeric values that follow after the "D".

I'm not sure if I am supposed to use '.str.extract' and how I would capture all the numeric values that follow right after the "D"

The output I would like to get is:

C1D1

" "

C1D1

Please note df_input[Visit] does not only have "C1D1". It has variations of the C#D# structure so it could be "C1D12" or "C2D9".

Solution

You can use a simple regex to recognize your pattern and then you can apply a function to dataframe to apply the recognizer to the whole column:

import pandas as pd
import re

def extract(year):
    matches = re.findall('C\dD\d{1,2}', year)
    if matches:
        return matches[0]  # Assuming you only want to retrieve the first occurrence


df_input = pd.DataFrame(data=['C1D1-Pre', 'C1D12-2hr Post', 'test'], columns=['VISIT'])

df_output = pd.DataFrame()
df_output['VISIT'] = df_input['VISIT'].apply(lambda x: extract(x))
print(df_output)

The output will be:

   VISIT
0   C1D1
1  C1D12
2   None

If you want empty string instead of None, you have to edit the extract function:

def extract(year):
    matches = re.findall('C\dD\d{1,2}', year)
    if matches:
        return matches[0]
    return ""