I am trying to read a large .npy file in CSharp. In order to do that i am trying to use the NumSharp nuget.
The file is 7GB jagged float array (float[][]). It has ~1 million vectors, each vector is a 960 dimension.
Note: To be more specific the data I use is the GIST from the following link Approximate Nearest Neighbors Large datasets.
The following is the method I use to load the data but it failes with an exception:
private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
{
var npyFilename = @$"{pathPrefix}.npy";
var v = np.load(npyFilename);//NDArray
candidates = v
.astype(np.float32)
.ToJaggedArray<float>()
.OfType<float[]>()
.Select(a =>
{
return a.OfType<float>().ToArray();
})
.ToList();
}
The exception is:
Exception thrown: 'System.OverflowException' in NumSharp.dll An unhandled exception of type 'System.OverflowException' occurred in NumSharp.dll Arithmetic operation resulted in an overflow.
How can I workaround this?
The NumSharp package has a limitation if the file is too big. Read the comments/answers below for more explanations. I added one answer with a suggestion for a workaround
However, As a good alternative is to save the data as .npz (refer to: numpy.savez()) and then the following package can do the job:
https://github.com/matajoh/libnpy
Code sample:
NPZInputStream npz = new NPZInputStream(npyFilename);
var keys = npz.Keys();
//var header = npz.Peek(keys[0]);
var t = npz.ReadFloat32(keys[0]);
Debug.Assert(t.DataType == DataType.FLOAT32);
The issue is that the NumSharp data-structure is a heavy RAM consumer and it seems to be the CSharp GC is not aware of what NumSharp is allocating so it reaches the RAM limit very fast.
So, In order to overcome this, I split the input npy file so that every part should not consume more than max memory allocation allowed in C# (2147483591). In my case i split into 5 different files (200k vectors each).
python part to split the large .npy file:
infile = r'C:\temp\input\GIST.1m.npy'
data = np.load(infile)
# create 5 files
incr = int(data.shape[0] / 5)
# the +1 is to handle any leftovers
r = range(0, int(size/incr + 1))
for i in r:
print(i)
start = i * incr
stop = min(start + incr, size)
if(start >= len(data)):
break
np.save(infile.replace('.npy', f'.{i}.npy'), data[start:stop])
Now in CSharp the code looks as follows:
private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
{
candidates = new List<float[]>();
// TODO:
// For now I am assuming there are 10 files maximum...
// this can be improved by scanning the input folder and
// collecting all the relevant files.
foreach (var i in Enumerable.Range(-1, 10))
{
var npyFilename = @$"{pathPrefix}.{i}.npy";
Console.WriteLine(npyFilename);
if (!File.Exists(npyFilename))
continue;
var v = np.load(npyFilename); //NDArray
var tempList = v
.astype(np.float32)
.ToJaggedArray<float>()
.OfType<float[]>()
.Select(a => { return a.OfType<float>().ToArray(); })
.ToList();
candidates.AddRange(tempList);
}
}