Search code examples
pythonpandasstringdataframesubstring

Copying substring that starts with "C" and ends with all the values after "D" in one dataframe and putting the substring into a new dataframe


df_input

The above photo is data stored in df_input.

I would like extract the "C#D#" part from the 'Visit' column and place it into the column of a new dataframe I created (df_output['VISIT']).

Additionally, there could be up to two numeric values that follow after the "D".

I'm not sure if I am supposed to use '.str.extract' and how I would capture all the numeric values that follow right after the "D"

The output I would like to get is:

C1D1

C1D1

" "

C1D1

Please note df_input[Visit] does not only have "C1D1". It has variations of the C#D# structure so it could be "C1D12" or "C2D9".


Solution

  • You can use a simple regex to recognize your pattern and then you can apply a function to dataframe to apply the recognizer to the whole column:

    import pandas as pd
    import re
    
    def extract(year):
        matches = re.findall('C\dD\d{1,2}', year)
        if matches:
            return matches[0]  # Assuming you only want to retrieve the first occurrence
    
    
    df_input = pd.DataFrame(data=['C1D1-Pre', 'C1D12-2hr Post', 'test'], columns=['VISIT'])
    
    df_output = pd.DataFrame()
    df_output['VISIT'] = df_input['VISIT'].apply(lambda x: extract(x))
    print(df_output)
    

    The output will be:

       VISIT
    0   C1D1
    1  C1D12
    2   None
    

    If you want empty string instead of None, you have to edit the extract function:

    def extract(year):
        matches = re.findall('C\dD\d{1,2}', year)
        if matches:
            return matches[0]
        return ""