python performance optimization scipy sparse-matrix

Efficient way to fill 2d array in Python

I have 3 arrays: array "words" of pairs ["id": "word"] by the length 5000000, array "ids" of unique ids by the length 13000 and array "dict" of unique words (dictionary) by the length 500000. This is my code:

matrix = sparse.lil_matrix((len(ids), len(dict)))
for i in words:
    matrix[id.index(i['id']), dict.index(i['word'])] += 1.0

But it works too slow (I haven't got a matrix after 15 hours of work). Are there any ideas to optimize my code?

Solution

First of all don't name your array dict, it is confusing as well as hides the built-in type dict.

The problem here is that you're doing everything in quadratic time, so convert your arrays dict and id to a dictionary first where each word or id point to its index.

matrix = sparse.lil_matrix((len(ids), len(dict)))
dict_from_dict = {word: ind for ind, word in enumerate(dict)}
dict_from_id = {id: ind for ind, id in enumerate(id)}
for i in words:
    matrix[dict_from_id[i['id']], dict_from_dict[i['word']] += 1.0