Search code examples
pythonpandasdataframenumpynlp

How to calculate a column using the most common words calculated from another dataframe in Python?


Example of the dataframe:

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

cake = pd.DataFrame(data=cup)

cake
  1. I calculated the most common words of the dataframe (with stop words removed)

    from collections import Counter
    
    Counter(" ".join(cake['Description']).split()).most_common()
    
  2. I put this into a new dataframe and reset the index

    count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())
    
    count.columns = ['Words', 'Values']
    
    count.index= np.arange(1, len(count)+1)
    
    count.head()
    
  3. The Values is in the 'count' dataframe. The Days_Sold is in the 'cake' dataframe. What I would like to do now is if the common word in the 'count' dataframe shows up, like cupcake, how long would this take for me to sell the product using the 'cake' dataframe, and that would go through every common word in the 'count' dataframe until it's done? The answer should come out to be (3+4+1) 8 for cupcake.

My actual dataframe is over 3000 lines (and not exactly about cakes). The description is longer. I need over 40 common words, adjustable to my need.

This is why I can't be typing in each word. I believe this requires a 'nested for loop'. But I am stuck on it.

for day in cake:

    for top in count:

       top= count.Words

    day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]

The error says: 'int' object is not iterable

Thank you!

Update:

Thank you so much to everyone helping me on this large project. I am posting my solution to #3, adjusted from the answer by Mark Moretto.

# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df

# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq

# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
            .loc[:, ["Description_x", "Values", "Days_Sold"]]
            .rename(columns={"Description_x": "Description"})
            )
df_freq

# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics

df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)

Solution

  • Another way, though I didn't really remove any words by frequency or anything.

    # <...your starter code for dataframe creation...>
    
    # Split and explode Description
    df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
    
    # Get count of words
    df_counts = (df.groupby("Description")
                .size()
                .reset_index()
                .rename(columns={0: "word_count"})
                )
    
    # Merge counts to main DataFrame
    df_freq = pd.merge(df, df_counts, on="Description")
    
    # Left join cake DataFrame onto df_freq by index values.
    df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
                .loc[:, ["Description_x", "word_count", "Days_Sold"]]
                .rename(columns={"Description_x": "Description"})
                )
    
    # Group by Description and return max result for value fields
    df_metrics = df_freq.groupby("Description").max()
    
    print(df_metrics)
    

    Output:

                 word_count  Days_Sold
    Description
    blueberry             2          4
    cookie                3          2
    cupcake               3          4
    grape                 1          2
    lemon                 1          1
    organic               3          2
    strawberry            2          3