Search code examples
pythonstringpandasfeature-extractionkaggle

best way for substring pandas data frame


Q1

I want to extract the Title of each person from the Name attribute of concat data frame. what is the best way of doing this?

concat['Title'][concat['Title'] == 'Mlle'] = 'Miss'
concat['Title'][concat['Title'] == 'Ms'] = 'Miss'
concat['Title'][concat['Title'] == 'Mme'] = 'Mrs'
concat['Title'][concat['Title'] == 'Dona' or 'Lady'or 'Countess'or'Capt' or 'Col'or'Don'or 'Dr'or 'Major'or 'Rev'or 'Sir'or 'Jonkheer' ] = 'Rare'

Q2

when i run the above code i get this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

why?


Refrences

the full problem with datasets : Titanic


Solution

  • Use str.split, and then extract the second item from the resultant list.

    In [37]: df['Name'].head()
    Out[37]: 
    0                              Braund, Mr. Owen Harris
    1    Cumings, Mrs. John Bradley (Florence Briggs Th...
    2                               Heikkinen, Miss. Laina
    3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
    4                             Allen, Mr. William Henry
    Name: Name, dtype: object
    

    An observation here is that names follow this format: Last Name, Salutation Given Name. We'll split on spaces and extract the Salutation from the split lists using df.apply:

    In [38]: df['Title'] = df['Name'].str.split(' ').apply(lambda x: x[1])
    
    In [39]: df['Title'].head()
    Out[39]: 
    0      Mr.
    1     Mrs.
    2    Miss.
    3     Mrs.
    4      Mr.
    Name: Title, dtype: object