Search code examples
pythonpandasdataframemachine-learningfeature-extraction

Extract new feature from sentence column - Python


I have two dataframes:

city_state dataframe

    city        state
0   huntsville  alabama
1   montgomery  alabama
2   birmingham  alabama
3   mobile      alabama
4   dothan      alabama
5   chicago     illinois
6   boise       idaho
7   des moines  iowa

and sentence dataframe

    sentence
0   marthy was born in dothan
1   michelle reads some books at her home
2   hasan is highschool student in chicago
3   hartford of the west is the nickname of des moines

I want to extract new feature from sentence dataframe called city. That column city is extracted from sentence if in the sentence contain a name of certain city from column city_state['city'], if it didn't contain a name of certain city its value will be Null.

The expected new dataframe will be like this:

    sentence                                        city
0   marthy was born in dothan                       dothan
1   michelle reads some books at her home           Null
2   hasan is highschool student in chicago          chicago
3   capital of dream is the motto of des moines     des moines

I have run this code

sentence['city'] ={}

for city in city_state.city:
    for text in sentence.sentence:
        words = text.split()
        for word in words:
            if word == city:
                sentence['city'].append(city)
                break
    else:
        sentence['city'].append(None)

but the result of this code is like this

ValueError: Length of values does not match length of index

If you have experience on feature engineering with similar case, could you give me some suggestion how to write the right code for expected result.

Thank you

Note: This is the full log of the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
      2 
      3 for city in city_state.city:
      4     for text in sentence.sentence:
      5         words = text.split()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

Solution

  • Some quick and dirty apply, haven't tested it on large dataframes so use it with caution. First define a function to extract city names:

    def ex_city(col, cities):
        output = []
        for w in cities:
            if w in col:
                output.append(w)
        return ','.join(output) if output else None
    

    Then apply it to your sentence dataframe

    city_list = city_state.city.unique().tolist()
    sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))