Search code examples
pythonpandasdataframetextcount

How to extract the top N rows from a dataframe with most frequent occurences of a word in a list?


I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -

DocName Content
Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2 The food is great. James loves his food. You not so much right ?
Doc3. Yeah he is alright.

I also have a list of 100 words as follows -

list = [food, you, ....]           

Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,

"food" occurs twice in Doc2 and once in Doc1.

"you" occurs twice in Doc 1 and once in Doc 2.

Hence, desired output is :

[food:[doc2, doc1], you:[doc1, doc2], .....]

where N = 2 ( top 2 rows having the most frequent occurence of each word )

I have tried something as follows but unsure how to move further -

list = [food, you, ....]
result = []

for word in list:
    result.append(df.Content.apply(lambda row: sum([row.count(word)])))

How can I implement an efficient solution to the above requirement in Python ?


Solution

  • Second attempt (initially I misunderstood your requirements): With df your dataframe you could try something like:

    words = ["food", "you"]
    n = 2  # Number of top docs
    res = (
        df
        .assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
        .explode("Content")
        .loc[lambda df: df["Content"].isin(set(words))]
        .groupby("DocName").value_counts().rename("Counts")
        .sort_values(ascending=False).reset_index(level=0)
        .assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
        .groupby(level=0).agg({"DocName": list})
        .assign(DocName=lambda df: df["DocName"].str[:n])
        .to_dict()["DocName"]
    )
    
    • The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:

         DocName Content
      0    Doc1     you
      0    Doc1     you
      0    Doc1    food
      1    Doc2    food
      1    Doc2    food
      1    Doc2     you
      
    • The next 2 lines count the words per doc (.groupby and .value_counts), and sort the result by the counts in descending order (.sort_values), and add the count to the doc-strings. For the sample:

               DocName  Counts
      Content                
      you      Doc1_2       2
      food     Doc2_2       2
      food     Doc1_1       1
      you      Doc2_1       1
      
    • Then .groupby the words (index) and put the respective docs in a list via .agg, and restrict the list to the n first items (.str[:n]). For the sample:

                         DocName
      Content                  
      food     [Doc2_2, Doc1_1]
      you      [Doc1_2, Doc2_1]
      
    • Finally dumping the result in a dictionary.

    Result for the sample dataframe

      DocName                                                              Content
    0    Doc1  Hi how you are doing ? Hope you are well. I hear the food is great!
    1    Doc2  The food is great. James loves his food. You not so much right ?
    2    Doc3  Yeah he is alright.
    

    is

    {'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}