Search code examples
pandasmachine-learningscipyscikit-learnsparse-matrix

How can I change my index vector into sparse feature vector that can be used in sklearn?


I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :

001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]

the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn. How can I do that?


Solution

  • One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.

    from scipy.sparse import coo_matrix
    
    input_data = [
        ("001436800277225", [12,456,157]),
        ("009092130698762", [248]),
        ("010003000431538", [361,521,83]),
        ("010156461231357", [173,67,244])    
    ]
    
    NUMBER_MOVIES = 1000 # maximum index of the movies in the data
    NUMBER_USERS = len(input_data) # number of users in the model
    
    # you'll probably want to have a way to lookup the index for a given user id.
    user_row_map = {}
    user_row_index = 0
    
    # structures for coo format
    I,J,data = [],[],[]
    for user, movies in input_data:
    
        if user not in user_row_map:
            user_row_map[user] = user_row_index
            user_row_index+=1
    
        for movie in movies:
            I.append(user_row_map[user])
            J.append(movie)
            data.append(1)  # number of times users watched the movie
    
    # create the matrix in COO format; then cast it to CSR which is much easier to use
    feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()