python pandas machine-learning scikit-learn countvectorizer

Sklearn - group by category and get top n words from each category of dataframe?

I have a pd dataframe laid out like this and named result:

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

Each post is unique, and target just assigns number 1-16 to each of the 16 types. There are tens of thousands of entries, however the 16 types repeat.

Im following Pandas groupby category, rating, get top value from each category? trying to pull the top n most used words for all posts under each of the 16 categories.

Per the SO post, I need to 1) group all posts of each type and 2) run TfidfVectorizer for each of the 16. What makes this difficult is how large the dataframe is.

So far, Ive tried grouping using:

result = result.reset_index()
print(result.loc[result.groupby('type').post.agg('idxmax')])

But this gives ValueError because I think agg can only be used with numbers. And I need to append all the strings into one.

On the top words part, I have that working with this example:

tfidf = TfidfVectorizer(stop_words='english')
corpus = [
    'I would like to check this document',
    'How about one more document',
    'Aim is to capture the key words from the corpus',
    'frequency of words in a document is called term frequency'
]

X = tfidf.fit_transform(corpus)
feature_names = np.array(tfidf.get_feature_names())


new_doc = ['can key words words words in this new document be identified?',
           'idf is the inverse document frequency frequency frequency caculcated for each of the words']
responses = tfidf.transform(new_doc)


def get_top_tf_idf_words(response, top_n=2):
    sorted_nzs = np.argsort(response.data)[:-(top_n+1):-1]
    return feature_names[response.indices[sorted_nzs]]

And I guess after grouping, I would just run a for loop, getting top words for each of the 16 VERY long strings?

How can I do this properly?

Solution

Not sure if this is what you're looking for, but you can instead try:

result.groupby('type')['post'].agg(pd.Series.mode) from https://stackoverflow.com/a/54304691/5323399

If you want to look at more than a single top value, you can try a lambda function with value_counts() like listed for that question, only adding nlargest() to the end of the function.