Search code examples
pythonpython-3.xnumpymatrixlarge-data

How to split in chunks (submatrices), or handle a huge matrix giving memory error on numpy?


I have a really large matrix that won't simply fit into memory. The matrix i have to work with has 483798149136 elements, this means 483 billions of floating point numbers.

The approach i was thinking about was to split this huge matrix somehow in different submatrices that fit into memory, perform pooling operations on these submatrices, and later join them all back to rebuild the original matrix which will hopefully fit into memory after all the pooling operations.

Please correct me if i'm wrong, this approach was just an idea i came out with, how good or bad it is, i dont know. If you have any better alternative ideas i'm open to any suggestions.

The code to reproduce this matrix would be:

a = np.arange(695556).reshape(834,834)
np.meshgrid(a,a)

I have been reading this post and this post, among other ones in this same site but none of them provides a true solution to these kind of problems, they just give vague suggestions.

My questions now are:

  1. Is my splitting and pooling approach feasable or are there any other better ways of doing this?

  2. How (in code terms) could i split this matrix into pieces (like windows or multidimensional kernels) and rebuild it up again later

  3. Is there some way to process a matrix in chunks in numpy to perform operations with the matrix latyer, like multiplication, addition, etc...

  4. Is there a specific package in Python that helps dealing with this kind of matrix problems

EDIT

Since some users are asking me about the goal of this whole operation, i'll provide some info:

I'm working on a some 3d printing project. In the process there is a laser beam that melts metal powder to create complex metal pieces. In this pieces there are layers, and the laser melts the metal layer by layer.

I have 3 csv files, each one containing a matrix of 834 x 834. The first matrix contain the coordinates values of the X axis when laser beam is going through powder bed and melting the metal, the second matrix is the same for the Y axis, and the third matrix represents the time the laser stands melting in the same pixel point. The values are expressed in seconds.

So i have the coordinates of the laser passing through the X and Y axes, and the time it takes to melt each point.

This matrices come out from images of the sections of each manufactured piece.

The issue is that the temperature in a certain pixel and the time the laser stands at that pixel can have an influence over the n pixel when the laser gets there. So i want to create a distance matrix that tells me how different or similar in terms of euclidean distance are each pixel of the image to each other.

This is why if i have for instance 2 834 x 834 matrices i need to create a matrix of 695556 x 695556 with the distances between every single point in the matrix to each other. And this is why is so huge and will not fit into memory.

I don't know if i gave too much information, or if my explanations are messy. You can ask whatever you need and i'll try to clarify it, but the main point is that i need to create this huge distance matrix in ordr to know the mathematical distances between pixels and then get to know the relation between what's happening in a certain point of the piece when printing it and what it needs to happen in other points to avoid manufacturing defects.

Thank you very much in advance


Solution

  • after all, i figured out a way to solve my problem. This huge matrices can be easily handled using dask. Dask is a python library that allows distributed computing and data division into chunks to optimize memory usage. It's pretty handy since it really allows you to work with literally huge and massive data at real low computing and memory cost, obviously, is not as fast as in-memory computing, but i think very people will be glad to know about this.

    This package is pretty good optimized and is often update. The best of it is that it has numpy/pandas sintax, it also works with dataframes the same way than with arrays and if you know pandas/numpy you will feel like in home with dask.

    You can create a dask distributed array like this:

    import numpy as np
    import dask.array as da
    
    Y = da.random.normal(size=(695556, 695556),
                             chunks=(1000, 1000))
    

    and then, you can perform some operations on it like this:

    y = Y.mean(axis=0)[0:100].compute()
    

    Also, if you use the memory_profiler package you can also monitor your memory and CPU usage, and see how much memory are your huge data consuming for computations.

    There are some practical examples i found very illustrative here.

    Also, the explanative array scope in this library can be found here.

    And lastly, a guide about high performance computations in Python 3.X here.

    Hope this helps someone with this same issue.