I have a data with a column with some words. i extracted some words by list of words, for example ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']. This is the list with right order of words and each word should be sorted by this order. When i extracted words, i create a Series of extracted words, but some rows in this series contains two word or no words. For example (actual length of a series is 25000):
index | ingredients |
---|---|
0 | sugar |
1 | yeast |
2 | |
3 | ananas milk |
4 | sugar water |
5 | milk |
what i want is to order those rows which contains two words, such as in index 3 and 4, by the order of ingredients_list. For example:
index | ingredients |
---|---|
0 | sugar |
1 | yeast |
2 | |
3 | milk ananas |
4 | water sugar |
5 | milk |
First what i did is to replace empty rows with 'unknown". Then i tried some codes:
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"\b{}\b".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")
then to sort them accordingly to ingredients_list:
def sort_list(list1, list2):
zipped_pairs = zip(list2, list1)
z = [x for _, x in sorted(zipped_pairs)]
return z
words = sort_list(ing_l, ingredients_list)
OR
d = {v:i for i, v in enumerate(ing_l)}
r = sorted(ingredients_list, key=lambda v: d[v])
But what i got is a list of length of 6, as ingredients_list length. Then i try:
ing_l= pd.DataFrame(ing_l)
ing_l['sort'] = [word for x in ingredients_list for word in ing_l if word == x]
But i have some error ValueError: Length of values (0) does not match length of index (25000) Do you have any solution to this problem? Thank you a lot
You can apply
sorted
with a custom dictionary on the split string and join
again:
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
order = {k:v for v,k in enumerate(ingredients_list)}
df['sorted_ingredients'] = (
df['ingredients']
.str.split()
.apply(lambda x: ' '.join(sorted(x, key=order.get)) if isinstance(x, list) else x)
)
output:
index ingredients sorted_ingredients
0 0 sugar sugar
1 1 yeast yeast
2 2 NaN NaN
3 3 ananas milk milk ananas
4 4 sugar water water sugar
5 5 milk milk