python arrays list size memory-efficient

Python: How to efficiently implement large symmetric array with many empty entries

I am using a very large 2-dimensional square array whose entries are lists of integers. This is not written in stone, but say it's a w by w array, with w=15000, and each entry is a list of integers, with random size from 0 to 400. This has two characteristics:

It is symmetric
Many of its entries are empty lists, including the ones on the diagonal

Right now, I am only taking advantage of the first part and implement it like this pseudocode:

F=[ [ [] for i in range(w)] for j in range(w) ] # Initiate a blank w X w array
for i in range(w):
    for j in range(i):
    If (condition):
        F[i][j] = [A list of integers]

Then in the end, I assign the rest of the values as follows:

for i in range(w):
    for j in range(w):
        F[i][j]=F[i][j] # Reference, thus size of F does not increase much

I am not an experienced programmer, and I'm feeling this approach is probably not very efficient. In particular, I feel that I am not taking advantage of the fact that many entries are empty.

Can I have a more efficient list F such that it doesn't occupy any space for the entries that end up being empty?

Mind that it is still important that I have the "coordinates" x, y assigned correctly.

Solution

There are a variety of ways to store sparse matrices, and Wikipedia is a good place to start. Without knowing what you're planning to do with these, nobody can tell you which one to use.

For example, the Dictionary of Keys design is very simple to build in Python:

d = collections.defaultdict(list)
for i in range(w):
    for j in range(w):
        if condition:
            d[i, j] = [a list of integers]

That's it. Now, to retrieve or change a value later:

value = d[i, j]
d[i, j].append(3)

But if you ever want to, say, iterate all of the non-empty rows in order, then all of the non-empty columns of each row in order, this is going to be terribly inefficient. The naive solution would be to just loop over every i and j and look up d[i, j], but that's going to allocate a new empty list to fill in every single position. You could instead sort the keys, but that can be slow as well. You could keep around coordinate lists sorted in both orders, alongside the dictionary—but at that point, the dictionary is just adding extra overhead over just sticking a reference to the value in each coordinate list.

So, there are other alternatives that are better for different uses.

If you're willing to use NumPy/SciPy arrays, scipy.sparse handles 2D sparse arrays. Of course they're normally arrays of numbers, but you can use dtype=object with most of the types (although that does mean that "empty" slots will have None instead of [], and there are also some odd limitations on what works in those cases). (There's also some documentation somewhere (sorry, no link…) that explains how the sparse matrices are implemented and how to do the equivalent in pure Python/NumPy, which you could use to create a WxWx400 sparse array of ints.)