I have a pandas dataframe with a series of strings. For each string, I want to extract the number after the letter "H", and put this value, as an integer, into a new column.
import pandas as pd
import numpy as np
inp = [{"H-Type": np.nan, 'SomeValue': "Influenza A(H1N1)pdm09 Virus"},
{"H-Type": np.nan, 'SomeValue': "Influencer A(H3N2) Virus"},
{"H-Type": 1, 'SomeValue': "Influenza A Virus"},
{"H-Type": np.nan, 'SomeValue': "Influenza A Virus"}]
df = pd.DataFrame(inp)
Out:
H-Type SomeValue
0 NaN Influenza A(H1N1)pdm09 Virus
1 NaN Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
What I need is the H-Type values like so:
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1 Influenza A Virus
3 NaN Influenza A Virus
I can use slice to get the corrrect value for the first row, like so:
df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(start=13, stop=14))
I can define the correct start (and stop) positions like so:
df["SomeValue"].str.find("H")+1
BUT, when I try to define the start and stop values using .str.find, it just returns NaN.
df["H-Type"].where(~df["H-Type"].isna(), df["SomeValue"].str.slice(
start=(df["SomeValue"].str.find("H")+1), stop=(df["SomeValue"].str.find("H")+2)))
Is there an alternative to .str.find to define the start and stop position? or am I applying this wrongly?
Use Series.str.extract
for replace missing values to new Series
:
df["H-Type"] = df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False))
print (df)
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus
If need integers with missing values cast digits to floats and then to Int64
:
df["H-Type"] = (df["H-Type"].fillna(df["SomeValue"].str.extract(r'H(\d)', expand=False)
.astype(float)).astype('Int64')
)
print (df)
H-Type SomeValue
0 1 Influenza A(H1N1)pdm09 Virus
1 3 Influencer A(H3N2) Virus
2 1 Influenza A Virus
3 <NA> Influenza A Virus
Your solution:
df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(),
df["SomeValue"].str.extract(r'H(\d)', expand=False))
But if need extract values by position use this solution (not test digits, only values after H
):
s = df["SomeValue"].apply(lambda x: x[x.find("H")+1: x.find("H")+2]
if x.find('H') != -1 else np.nan)
df["H-Type"] = df["H-Type"].where(df["H-Type"].notna(), s.astype(float))
print (df)
H-Type SomeValue
0 1.0 Influenza A(H1N1)pdm09 Virus
1 3.0 Influencer A(H3N2) Virus
2 1.0 Influenza A Virus
3 NaN Influenza A Virus