I have two dataframes:
city_state
dataframe
city state
0 huntsville alabama
1 montgomery alabama
2 birmingham alabama
3 mobile alabama
4 dothan alabama
5 chicago illinois
6 boise idaho
7 des moines iowa
and sentence dataframe
sentence
0 marthy was born in dothan
1 michelle reads some books at her home
2 hasan is highschool student in chicago
3 hartford of the west is the nickname of des moines
I want to extract new feature from sentence dataframe called city. That column city
is extracted from sentence
if in the sentence contain a name of certain city
from column city_state['city']
, if it didn't contain a name of certain city
its value will be Null.
The expected new dataframe will be like this:
sentence city
0 marthy was born in dothan dothan
1 michelle reads some books at her home Null
2 hasan is highschool student in chicago chicago
3 capital of dream is the motto of des moines des moines
I have run this code
sentence['city'] ={}
for city in city_state.city:
for text in sentence.sentence:
words = text.split()
for word in words:
if word == city:
sentence['city'].append(city)
break
else:
sentence['city'].append(None)
but the result of this code is like this
ValueError: Length of values does not match length of index
If you have experience on feature engineering with similar case, could you give me some suggestion how to write the right code for expected result.
Thank you
Note: This is the full log of the error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-205-8a9038a015ee> in <module>
----> 1 sentence['city'] ={}
2
3 for city in city_state.city:
4 for text in sentence.sentence:
5 words = text.split()
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~\Anaconda3\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index
Some quick and dirty apply, haven't tested it on large dataframes so use it with caution. First define a function to extract city names:
def ex_city(col, cities):
output = []
for w in cities:
if w in col:
output.append(w)
return ','.join(output) if output else None
Then apply it to your sentence dataframe
city_list = city_state.city.unique().tolist()
sentence['city'] = sentence['sentence'].apply(lambda x: ex_city(x, city_list))