I am attempting to read two .dat files and create a program that uses the value of aid2name as a key in a dictionary that has the key and values of aid2numplays, set as its values. This is all done in hopes that the file will produce a result that includes (artist name, artist id, frequency of plays). Worth noting that the first file provides artist name and artist id, while the second file provides user id, artist id, and frequency per user. Any ideas how to aggregate those frequencies by user and then display them in the (artist name, artist id, frequency of plays) format? Below is what I have managed so far:
import codecs
aid2name = {}
d2 = {}
fp = codecs.open("artists.dat", encoding = "utf-8")
fp.readline() #skip first line of headers
for line in fp:
line = line.strip()
fields = line.split('\t')
aid = int(fields[0])
name = fields[1]
aid2name = {int(aid), name}
d2.setdefault(fields[1], {})
#print (aid2name)
# do other processing
#print(dictionary)
aid2numplays = {}
fp = codecs.open("user_artists.dat", encoding = "utf-8")
fp.readline() #skip first line of headers
for line in fp:
line = line.strip()
fields = line.split('\t')
uid = int(fields[0])
aid = int(fields[1])
weight = int(fields[2])
aid2numplays = [int(aid), int(weight)]
#print(aid2numplays)
#print(uid, aid, weight)
for (d2.fields[1], value) in d2:
group = d2.setdefault(d2.fields[1], {}) # key might exist already
group.append(aid2numplays)
print(group)
Edit: Regarding the use of setdefault
, if you wanted to group the user data by artistID then you could:
grouped_data = {}
for u in users:
k, v = u[1], {'userID': u[0], 'weight': u[2]}
grouped_data.setdefault(k, []).append(v)
This is essentially the same as writing:
grouped_data = {}
for u in users:
k, v = u[1], {'userID': u[0], 'weight': u[2]}
if k in grouped_data:
grouped_data[k].append(v)
else:
grouped_data[k] = [v]
As an example for how to count the number of times an artist appears in different users data, you could read the data into lists of lists:
with codecs.open("artists.dat", encoding = "utf-8") as f:
artists = f.readlines()
with codecs.open("user_artists.dat", encoding = "utf-8") as f:
users = f.readlines()
artists = [x.strip().split('\t') for x in artists][1:] # [['1', 'MALICE MIZER', ..
users = [x.strip().split('\t') for x in users][1:] # [['2', '51', '13883'], ..]
Iterate over artists creating a dictionary using the artistID as a key. Add a placeholder for the play stats.
data = {}
for a in artists:
artistID, name = a[0], a[1]
data[artistID] = {'name': name, 'plays': 0}
Iterate over users updating the dictionary with each row:
for u in users:
artistID = u[1]
data[artistID]['plays'] += 1
Output for data:
{'1': {'name': 'MALICE MIZER', 'plays': 3},
'2': {'name': 'Diary of Dreams', 'plays': 12},
'3': {'name': 'Carpathian Forest', 'plays': 3}, ..}
Edit: To iterate over the user data and create a dictionary of all the artists associated with a user we could:
artist_list = [x.strip().split('\t') for x in artists][1:]
user_stats_list = [x.strip().split('\t') for x in users][1:]
artists = {}
for a in artist_list:
artistID, name = a[0], a[1]
artists[artistID] = name
grouped_user_stats = {}
for u in user_stats_list:
userID, artistID, weight = u
if userID not in grouped_user_stats:
grouped_user_stats[userID] = { artistID: {'name': artists[artistID], 'plays': 1} }
else:
if artistID not in grouped_user_stats[userID]:
grouped_user_stats[userID][artistID] = {'name': artists[artistID], 'plays': 1}
else:
grouped_user_stats[userID][artistID]['plays'] += 1
print('this never happens')
# it looks the same artist is never listed twice for the same user
Output:
{'2': {'100': {'name': 'ABC', 'plays': 1},
'51': {'name': 'Duran Duran', 'plays': 1},
'52': {'name': 'Morcheeba', 'plays': 1},
'53': {'name': 'Air', 'plays': 1}, .. },
..
}