Search code examples
pythonpandasdataframegroup-by

How to transform top n values in a dataframe column according to conditions?


I have a pandas dataframe that contains reviews. And for each review, I have the different words with a specific score as below:

import pandas as pd

df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3,3],
    "review": ["The second review", "The second review", "The second review",
               "This is the first review", "This is the first review",
               "This is the first review", "This is the first review",
               "This is the first review",'Not Noo', 'Not Noo'],
    "token_num":[1,2,3,1,2,3,4,5,1,2],
    "token":["The", "second", "review", "This", "is", "the", "first", "review", "Not", "Noo"],
    "score":[0.3,-0.6,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
})

With the following code I am able to modify the review by applying the transformation function to the word with the max score and I create a new dataframe that contains the old and the new review.

# Identify the line with the max score for each review
token_max_score = df.groupby("review_num", sort=False)["score"].idxmax()

# keep only lines with max score by review
Modified_df = df.loc[token_max_score, ["review_num", "review"]]


def modify_word(w):
    return w + "E"  # just to simplify the example


# Add the new column
Modified_df = Modified_df.join(
    pd.DataFrame(
        {
            "Modified_review": [
                txt.replace(w, modify_word(w))
                for w, txt in zip(
                    df.loc[token_max_score, "token"], df.loc[token_max_score, "review"]
                )
            ]
        },
        index=token_max_score,
    )
)

I need to apply the transformation function n times, not just one time (as in my code)

The current modified dataframe is:

   review_num                    review           Modified_review
2           2         The second review        The second reviewE
5           1  This is the first review  This is theE first review
9           3                   Not Noo                    Not NooE

The expected modified dataframe for n=2 is:

   review_num                    review              Modified_review
2           2         The second review          TheE second reviewE
5           1  This is the first review   This isE theE first review
9           3                   Not Noo                    NotE NooE

Thank you for help.


Solution

  • Here is one way to do it with Pandas apply:

    # Group and sort in descending order tokens and scores
    df = df.groupby(["review_num", "review"]).agg(list)[["token", "score"]]
    df["token_and_score"] = df.apply(
        lambda x: {t: s for t, s in zip(x["token"], x["score"])}, axis=1
    )
    df["token_and_score"] = df["token_and_score"].apply(
        lambda x: sorted(x.items(), key=lambda y: y[1], reverse=True)
    )
    
    # Iterate on new column "modified_review" and apply 'modify_word' function
    df = df.reset_index()
    df["modified_review"] = df["review"]
    N = 2
    for i in range(N):
        df["modified_review"] = df.apply(
            lambda x: " ".join(
                [
                    modify_word(word)
                    if (
                        i < len(x["token_and_score"]) and word == x["token_and_score"][i][0]
                    )
                    else word
                    for word in x["modified_review"].split(" ")
                ]
            ),
            axis=1,
        )
    
    # Cleanup
    df = df[["review_num", "review", "modified_review"]]
    

    Then:

    print(df)
    # Output
       review_num                    review             modified_review
    0           1  This is the first review  This isE theE first review
    1           2         The second review         TheE second reviewE
    2           3                   Not Noo                   NotE NooE