Search code examples
pythondictionaryscipysparse-matrixkey-value

How to create a sparse binary matrix from a dictionary in python


I have a .tsv file from which I've created a pyhton dictionary where the keys are all the movie_id and the values are the features (every movie has a different number of features).

Here's an example of my dictionary:

enter image description here

Goal to achieve:

From this dictionary I want to create an item-features sparse matrix to use for a recommender system project. At the end I would like to have a binary sparse matrix with 1 when a movie has a certain feature. Something like this:

enter image description here

My code:

To create the dictionary:

def Dictionary():
    d={}
    l=[]
    with open(filepath_mapping) as f:
        for line in f.readlines():
            line = line.split()
            key = int(line[0])
            value = [int(el) for el in line[1:]]
            d[key] = value
    return(d)

movie_features_dict = Dictionary()

To create the item-features matrix from the dictionary:

n = len(movie_features_dict)
value_lengths = [len(v) for v in movie_features_dict.values()]
d = max(value_lengths)
print(f"ITEM*FEATURES matrix shape: {n,d}\n")

item_feature_matrix = sp.dok_matrix((n,d), dtype=np.int8)

for movie_ids, features in movie_features_dict.items():
    item_feature_matrix[movie_ids, features] = 1

item_feature_matrix = item_feature_matrix.tocsr()
print(item_feature_matrix.shape)

Issues:

I have 22069 movies and the movie with the maximum number of features should have 885 features, so theoretically I should have a 22069*885 matrix, but with the code I've written I continue having this error:

raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (614734) out of range

Solution

  • I write this answer for future users that will have a similar problem.

    As I said in comments to other answers above, the creation of a new pandas dataframe was not useful for my needs, so this is the solution I've implemented.

    Based on this answer I've created the sparse matrix in this way:

    from sklearn.feature_extraction import DictVectorizer
        
    restructured = []
    for key in movie_features_dict:
        data_dict = {}
        for feat in movie_features_dict[key]:
            data_dict[feat] = 1
        restructured.append(data_dict)
    
    dictvectorizer = DictVectorizer(sparse=True)
    matrix_item_features = dictvectorizer.fit_transform(restructured)
    print(f"Item-feature matrix shape: {matrix_item_features.shape}")
    

    You can take a view here and here to have a better understanding of how DictVectorizer works.