Search code examples
pythonpython-3.xstringdatasetkaggle

How to extract first word from DataFrame


Background

I have created the below data frame combining two dataset from Kaggle.

Titanic: Machine Learning from Disaster (input/titanic/train.csv)

titanic-nationalities

DataFrame name: output

    PassengerId Nationality Name
0   1   CelticEnglish   Braund, Mr. Owen Harris
1   2   CelticEnglish   Cumings, Mrs. John Bradley (Florence Briggs Th...
2   3   Nordic,Scandinavian,Sweden  Heikkinen, Miss. Laina
3   4   CelticEnglish   Futrelle, Mrs. Jacques Heath (Lily May Peel
....

What I hoped to transform

    PassengerId Nationality Name
0   1   CelticEnglish   Braund
1   2   CelticEnglish   Cumings
2   3   Nordic  Heikkinen
3   4   CelticEnglish   Futrelle
....

Problem

I tried to execute the below code, but I have no idea to fix the below.

Error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
----> 1 output['Nationality'].split('\n', 1)[0]
      2 output['Name'].split('\n', 1)[0]

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5137             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5138                 return self[name]
-> 5139             return object.__getattribute__(self, name)
   5140 
   5141     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

Code

output['Nationality'].split('\n', 1)[0]
output['Name'].split('\n', 1)[0]

What I tried to do

I tried to change the type conversion, but the result was not changed.

output['Nationality'] = output['Nationality'].astype(str)
output['Name'] = output['Name'].astype(str)

output['Nationality'] = output['Nationality'].str.split('\n', expand=True)[0]
output['Name'] = output['Name'].str.split('\n', expand=True)[0]
output
PassengerId Nationality Name
0   1   CelticEnglish   Braund, Mr. Owen Harris
1   2   CelticEnglish   Cumings, Mrs. John Bradley (Florence Briggs Th...
2   3   Nordic,Scandinavian,Sweden  Heikkinen, Miss. Laina
3   4   CelticEnglish   Futrelle, Mrs. Jacques Heath (Lily May Peel)

Environment

Kaggle Notebook


Solution

  • A Series object doesn't have a split method. You're trying to split a string so you'll need to convert the column datatype into string first (or expand the column out into multiple columns) before applying a split.

    check data type of columns with df.dtypes

    assign datatype with output['Nationality'].astype(str)

    edit: no parentheses on dtype call