Search code examples
pandasextractassignment-operator

How to assign Pandas.Series.str.extractall() result back to original dataset? (TypeError: incompatible index of inserted column with frame index)


Dataset brief overview

dete_resignations['cease_date'].head()

gives

result

dete_resignations['cease_date'].value_counts()

gives

result of the code above


What I tried

I was trying to extract only the year value (e.g. 05/2012 -> 2012) from 'dete_resignations['cease_date']' using 'Pandas.Series.str.extractall()' and assign the result back to the original dataframe. However, since not all the rows contain that specific string values(e.g. 05/2012), an error occurred.

Here are the code I wrote.

pattern = r"(?P<month>[0-1][0-9])/?(?P<year>[0-2][0-9]{3})"
years = dete_resignations['cease_date'].str.extractall(pattern)
dete_resignations['cease_date_'] = years['year']

'TypeError: incompatible index of inserted column with frame index'


I thought the 'years' share the same index with 'dete_resignations['cease']'. Therefore, even though two dataset's index is not identical, I expected python automatically matches and assigns the values to the right rows. But it didn't

Can anyone help solve this issue?

Much appreciated if someone can enlighten me!


Solution

  • If you only want the years, then don't catch the month in pattern, and you can use extract instead of extractall:

    # the $ indicates end of string
    # \d is equivalent to [0-9]
    # pattern extracts the last digit groups
    pattern = '(?P<year>\d+)$'
    years = dete_resignations['cease_date'].str.extract(pattern)
    dete_resignations['cease_date_'] = years['year']