python group-by cluster-analysis spotify

How do I extract values from Python groupby for data analysis?

New to python and I am using Spotify's spotipy package in Python to try and create a "music taste diversity score" for my top spotify artists. I have clustered all the artists songs based on 7 of spotify's built in features and displayed the number of songs in each cluster as well as the distribution of artists in each cluster. I am struggling with getting these values into workable numpy arrays or df's to move onto generating my "diversity score" as this is my first full project dealing with python.

# display distribution of clusters

df2 = df.groupby(['cluster group'])['artist'].size()
df2

output: cluster distribution

# display distribution of artists in cluster groups

df2 = df.groupby(['cluster group', 'artist']).size()
df2

output: artist distribution in clusters

I have tried iterating for artist in cluster and other similar methods, but it seems the only iterable in df2 is the count for each cluster's artists.

Can someone point me towards a way to extract the values for each cluster group to work with?

Solution

I have tried iterating for artist in cluster and other similar methods, but it seems the only iterable in df2 is the count for each cluster's artists.

The reason is that df2 is, in fact, a pandas Series in both your examples. Therefore df2 just contains the sequence of integer values indexed by cluster group in the first example and (cluster group, artist) in the second example.

The exact dataFrame that you work with is not specified, but based on what you wrote it seems that you want perform different aggregation for differernt columns. Is that correct?. Let's say, if there is column songs then the following should give you a single dataframe combining some information from your two df2s, namely, number of different artists and songs in each cluster.

 df.groupby(['cluster group']).agg({'artist': 'nunique', 'songs': 'nunique'})

As for the first example, applying the aggregate function size() only to artists column rather than all the remaining columns, does not make any difference, since only the number of rows is counted and therefore df.groupby(['cluster group']).size() gives the same result.