I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName | Content |
---|---|
Doc1 | Hi how you are doing ? Hope you are well. I hear the food is great! |
Doc2 | The food is great. James loves his food. You not so much right ? |
Doc3. | Yeah he is alright. |
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?
Second attempt (initially I misunderstood your requirements): With df
your dataframe you could try something like:
words = ["food", "you"]
n = 2 # Number of top docs
res = (
df
.assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
.explode("Content")
.loc[lambda df: df["Content"].isin(set(words))]
.groupby("DocName").value_counts().rename("Counts")
.sort_values(ascending=False).reset_index(level=0)
.assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
.groupby(level=0).agg({"DocName": list})
.assign(DocName=lambda df: df["DocName"].str[:n])
.to_dict()["DocName"]
)
The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:
DocName Content
0 Doc1 you
0 Doc1 you
0 Doc1 food
1 Doc2 food
1 Doc2 food
1 Doc2 you
The next 2 lines count the words per doc (.groupby
and .value_counts
), and sort the result by the counts in descending order (.sort_values
), and add the count to the doc-strings. For the sample:
DocName Counts
Content
you Doc1_2 2
food Doc2_2 2
food Doc1_1 1
you Doc2_1 1
Then .groupby
the words (index) and put the respective docs in a list via .agg
, and restrict the list to the n
first items (.str[:n]
). For the sample:
DocName
Content
food [Doc2_2, Doc1_1]
you [Doc1_2, Doc2_1]
Finally dumping the result in a dictionary.
Result for the sample dataframe
DocName Content
0 Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
1 Doc2 The food is great. James loves his food. You not so much right ?
2 Doc3 Yeah he is alright.
is
{'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}