Extracting parts of a string based on substrings in a dataframe - python

I am trying to extract some NBA data, but my dataframe has a column where all of the statistics are in one string separated by spaces in a Stats column as shown: eg. 29pt 15rb 3as 1bl ...etc. I want to extract the values of each statistic so that I have a column for points, rebounds, assist, blocks, etc. The issue I am running into is some rows might not contain every statistic. For example, if a player did not get any assists, their value in the Stats column could look like this: 14pt 3rb 2bl. I am having trouble extracting the values I need for each statistic, especially if the statistic doesn't exist for that player. I have tried to split the string so that its values are placed into a list, and this was promising, but I still cannot figure out how to get the values for each statistic if that statistic even exists for that player. Any recommendations, ideas, or guidance would be greatly appreciated. Thanks.

Solution

We can step through this using Series.str.findall to extract the values (regex explained here), then convert that to a dict that we make back into a DataFrame that has the statistic as the column label and values.

>>> import pandas as pd
>>> s = pd.Series(["29pt 15rb 3as 1bl", "14pt 3rb 2bl"])
>>> s
# 0    29pt 15rb 3as 1bl
# 1         14pt 3rb 2bl
>>> df = pd.DataFrame(
    s.str.findall(r"(\d+)(\w+)")
    .apply(lambda y: {k: v for v, k in y})
    .to_list()
)
>>> df = df.apply(pd.to_numeric)
>>> df
#    pt  rb   as  bl
# 0  29  15  3.0   1
# 1  14   3  NaN   2

This DataFrame can then be merged onto your original one matching on index.

s.str.findall(r"(\d+)(\w+)")

This line applies the regex r"(\d+)(\w+)" which defines two capture groups (the parentheses ()). The first capture group looks for numbers (\d+), and the second capture group looks for word characters (\w+, equivalent to [a-zA-Z0-9_]^†). Using findall returns a list of tuples e.g

"29pt 15rb 3as 1bl" => [('29', 'pt'), ('15', 'rb'), ('3', 'as'), ('1', 'bl')]

Next, .apply(lambda y: {k: v for v, k in y}) is chained on which applies the lambda function over each row. This function converts the list of tuples into a dictionary where the keys are the second element from the tuples above (pt, rb, ...) and the values are the numbers.

[('29', 'pt'), ('15', 'rb'), ('3', 'as'), ('1', 'bl')]
# becomes
{'pt': '29', 'rb': '15', 'as': '3', 'bl': '1'}

Finally, .to_list() converts our pd.Series into a list of dict, which we can use to create a DataFrame that is mergeable, with the original.

† As \w+ is a greedy token that matches any letter, digit, or underscore it maybe better to use [a-zA-Z]+ instead, which will only match letters.