Search code examples
pandasstringfindsliceseries

In a pandas series of text values, how can I slice out a single character following a specific marker


I have a pandas dataframe with a series of strings. For each string, I want to extract the number after the letter "H", and put this value, as an integer, into a new column.

import pandas as pd
import numpy as np
inp = [{"H-Type": np.nan, 'SomeValue': "Influenza A(H1N1)pdm09 Virus"},
       {"H-Type": np.nan, 'SomeValue': "Influencer A(H3N2) Virus"},
       {"H-Type": 1, 'SomeValue': "Influenza A Virus"}, 
       {"H-Type": np.nan, 'SomeValue': "Influenza A Virus"}]
df = pd.DataFrame(inp)
Out:
   H-Type                     SomeValue
0     NaN  Influenza A(H1N1)pdm09 Virus
1     NaN      Influencer A(H3N2) Virus
2     1.0             Influenza A Virus
3     NaN             Influenza A Virus

What I need is the H-Type values like so:

   H-Type                     SomeValue
0     1      Influenza A(H1N1)pdm09 Virus
1     3      Influencer A(H3N2) Virus
2     1             Influenza A Virus
3     NaN             Influenza A Virus

I can use slice to get the corrrect value for the first row, like so:

df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(start=13, stop=14))

I can define the correct start (and stop) positions like so:

df["SomeValue"].str.find("H")+1

BUT, when I try to define the start and stop values using .str.find, it just returns NaN.

df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(
    start=(df["SomeValue"].str.find("H")+1), stop=(df["SomeValue"].str.find("H")+2)))

Is there an alternative to .str.find to define the start and stop position? or am I applying this wrongly?


Solution

  • Use Series.str.extract for replace missing values to new Series:

    df["H-Type"] = df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False))
    
    print (df)
      H-Type                     SomeValue
    0      1  Influenza A(H1N1)pdm09 Virus
    1      3      Influencer A(H3N2) Virus
    2    1.0             Influenza A Virus
    3    NaN             Influenza A Virus
    

    If need integers with missing values cast digits to floats and then to Int64:

    df["H-Type"] = (df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False)
                                        .astype(float)).astype('Int64')
                    )
    
    print (df)
       H-Type                     SomeValue
    0       1  Influenza A(H1N1)pdm09 Virus
    1       3      Influencer A(H3N2) Virus
    2       1             Influenza A Virus
    3    <NA>             Influenza A Virus
    

    Your solution:

    df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(),
                                      df["SomeValue"].str.extract(r'H(\d)', expand=False))
    

    But if need extract values by position use this solution (not test digits, only values after H):

    s = df["SomeValue"].apply(lambda x: x[x.find("H")+1: x.find("H")+2] 
                                        if x.find('H') != -1 else np.nan)
    df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(), s.astype(float))
    print (df)
       H-Type                     SomeValue
    0     1.0  Influenza A(H1N1)pdm09 Virus
    1     3.0      Influencer A(H3N2) Virus
    2     1.0             Influenza A Virus
    3     NaN             Influenza A Virus