New to python and I am using Spotify's spotipy package in Python to try and create a "music taste diversity score" for my top spotify artists. I have clustered all the artists songs based on 7 of spotify's built in features and displayed the number of songs in each cluster as well as the distribution of artists in each cluster. I am struggling with getting these values into workable numpy arrays or df's to move onto generating my "diversity score" as this is my first full project dealing with python.
# display distribution of clusters
df2 = df.groupby(['cluster group'])['artist'].size()
df2
output: cluster distribution
# display distribution of artists in cluster groups
df2 = df.groupby(['cluster group', 'artist']).size()
df2
output: artist distribution in clusters
I have tried iterating for artist
in cluster
and other similar methods, but it seems the only iterable
in df2
is the count for each cluster's artists.
Can someone point me towards a way to extract the values for each cluster group to work with?
I have tried iterating for artist in cluster and other similar methods, but it seems the only iterable in df2 is the count for each cluster's artists.
The reason is that df2
is, in fact, a pandas Series
in both your examples.
Therefore df2
just contains the sequence of integer values indexed by cluster group
in the first example and (cluster group, artist)
in the second example.
The exact dataFrame that you work with is not specified, but based on what you wrote it seems that you want perform different aggregation for differernt columns. Is that correct?. Let's say, if there is column songs
then the following should give you a single dataframe combining some information from your two df2
s, namely, number of different artists and songs in each cluster.
df.groupby(['cluster group']).agg({'artist': 'nunique', 'songs': 'nunique'})
As for the first example, applying the aggregate function size()
only to artists
column rather than all the remaining columns, does not make any difference, since only the number of rows is counted and therefore df.groupby(['cluster group']).size()
gives the same result.