I have a text file containing an upper 'triangular' matrix, the lower values being omitted (here's an example below):
3 5 3 5 1 8 1 6 5 8
5 8 1 1 6 2 9 6 4
2 0 5 2 1 0 0 3
2 2 5 1 0 1 0
1 3 6 3 6 1
4 2 4 3 7
4 0 0 1
0 1 8
2 1
1
Since the file in question is ~10000 lines in size, I was wondering if there was a 'smart' way to generate a numpy
matrix from it e.g. using the genfromtxt
function. However using it directly throws an error on the lines of
Line #12431 (got 6 columns instead of 12437)
and using filling_values
won't work as there's no way to designate the no missing value placeholders.
Right now I have to resort to manually open and close the file:
import numpy as np
def load_updiag(filename, size):
output = np.zeros((size,size))
line_count = 0
for line in f:
data = line.split()
output[line_count,line_count:size]= data
line_count += 1
return output
Which I feel is probably not very scalable for large file sizes.
Is there a way to properly use genfromtxt
(or any other optimized function from numpy's library) on such matrices?
You can read the raw data from the file into a string, and then use np.fromstring
to get a 1-d array of the upper triangular part of the matrix:
with open('data.txt') as data_file:
data = data_file.read()
arr = np.fromstring(data, sep=' ')
Alternatively, you can define a generator to read one line of your file at a time, then use np.fromiter
to read a 1-d array from this generator:
def iter_data(path):
with open(path) as data_file:
for line in data_file:
yield from line.split()
arr = np.fromiter(iter_data('data.txt'), int)
If you know the size of the matrix (which you can determine from the first line of the file), you can specify the count
keyword argument of np.fromiter
so that the function will pre-allocate exactly the right amount of memory, which will be faster. That's what these functions do:
def iter_data(fileobj):
for line in fileobj:
yield from line.split()
def read_triangular_array(path):
with open(path) as fileobj:
n = len(fileobj.readline().split())
count = int(n*(n+1)/2)
with open(path) as fileobj:
return np.fromiter(iter_data(fileobj), int, count=count)
This "wastes" a little work, since it opens the file twice to read the first line and get the count of entries. An "improvement" would be to save the first line and chain it with the iterator over the rest of the file, as in this code:
from itertools import chain
def iter_data(fileobj):
for line in fileobj:
yield from line.split()
def read_triangular_array(path):
with open(path) as fileobj:
first = fileobj.readline().split()
n = len(first)
count = int(n*(n+1)/2)
data = chain(first, iter_data(fileobj))
return np.fromiter(data, int, count=count)
All of these approaches yield
>>> arr
array([ 3., 5., 3., 5., 1., 8., 1., 6., 5., 8., 5., 8., 1.,
1., 6., 2., 9., 6., 4., 2., 0., 5., 2., 1., 0., 0.,
3., 2., 2., 5., 1., 0., 1., 0., 1., 3., 6., 3., 6.,
1., 4., 2., 4., 3., 7., 4., 0., 0., 1., 0., 1., 8.,
2., 1., 1.])
This compact representation might be all you need, but if you want the full square matrix you can allocate a zeros matrix of the right size and copy arr
into it using np.triu_indices_from
, or you can use scipy.spatial.distance.squareform
:
>>> from scipy.spatial.distance import squareform
>>> squareform(arr)
array([[ 0., 3., 5., 3., 5., 1., 8., 1., 6., 5., 8.],
[ 3., 0., 5., 8., 1., 1., 6., 2., 9., 6., 4.],
[ 5., 5., 0., 2., 0., 5., 2., 1., 0., 0., 3.],
[ 3., 8., 2., 0., 2., 2., 5., 1., 0., 1., 0.],
[ 5., 1., 0., 2., 0., 1., 3., 6., 3., 6., 1.],
[ 1., 1., 5., 2., 1., 0., 4., 2., 4., 3., 7.],
[ 8., 6., 2., 5., 3., 4., 0., 4., 0., 0., 1.],
[ 1., 2., 1., 1., 6., 2., 4., 0., 0., 1., 8.],
[ 6., 9., 0., 0., 3., 4., 0., 0., 0., 2., 1.],
[ 5., 6., 0., 1., 6., 3., 0., 1., 2., 0., 1.],
[ 8., 4., 3., 0., 1., 7., 1., 8., 1., 1., 0.]])