Here is a snippet of a large data set I am working with:
# p* T* P* U* P*_cs U*_cs Steps dt*
0.1 6.0 0.6499(6) -0.478(2) 0.6525 -0.452 30000 0.002
0.2 6.0 1.442(1) -0.942(2) 1.452 -0.890 30000 0.002
0.3 6.0 2.465(3) -1.376(1) 2.489 -1.298 30000 0.002
0.4 6.0 3.838(5) -1.785(3) 3.880 -1.681 20000 0.002
0.5 6.0 5.77(1) -2.131(3) 5.84 -2.000 20000 0.002
0.6 6.0 8.51(2) -2.382(5) 8.60 -2.225 20000 0.002
0.7 6.0 12.43(2) -2.501(4) 12.56 -2.318 20000 0.002
0.8 6.0 18.05(2) -2.416(4) 18.22 -2.207 20000 0.002
0.9 6.0 26.00(2) -2.058(4) 26.21 -1.823 20000 0.004
1.0 6.0 37.06(3) -1.361(6) 37.32 -1.100 20000 0.002
1.1 6.0 52.25(2) -0.216(4) 52.57 0.072 20000 0.002
1.2 6.0 72.90(5) 1.502(9) 73.28 1.816 20000 0.002
1.25 6.0 85.71(5) 2.612(8) 86.12 2.939 20000 0.002
Loading in this data set using np.loadtxt
fails because of the uncertainties for the P* and U* values. Is there a built-in tool for handling this to avoid manually editing the data files?
I am looking at the uncertainties package as a possible solution but I wonder if numpy already has something for this.
In [1]: txt=b"""# p* T* P* U* P*_cs U*_cs Steps dt*
...: 0.1 6.0 0.6499(6) -0.478(2) 0.6525 -0.452 30000 0.002
...: 0.2 6.0 1.442(1) -0.942(2) 1.452 -0.890 30000 0.002
...: 0.3 6.0 2.465(3) -1.376(1) 2.489 -1.298 30000 0.002"""
In [2]: txt=txt.splitlines()
txt
is a file substitue (bytestring in PY3)
In [3]: data=np.genfromtxt(txt, dtype=None, names=True)
In [4]: data
Out[4]:
array([(0.1, 6.0, b'0.6499(6)', b'-0.478(2)', 0.6525, -0.452, 30000, 0.002),
(0.2, 6.0, b'1.442(1)', b'-0.942(2)', 1.452, -0.89, 30000, 0.002),
(0.3, 6.0, b'2.465(3)', b'-1.376(1)', 2.489, -1.298, 30000, 0.002)],
dtype=[('p', '<f8'), ('T', '<f8'), ('P', 'S9'), ('U', 'S9'), ('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])
'P' and 'U' are loaded as strings because they can't be parsed as numbers.
Now define a converter
that strips off the ()
part (again with bytestrings)
def rmvpar(astr):
return float(astr.split(b'(')[0])
In [9]: data=np.genfromtxt(txt, dtype=None, names=True,
converters={2:rmvpar, 3:rmvpar})
In [10]: data
Out[10]:
array([(0.1, 6.0, 0.6499, -0.478, 0.6525, -0.452, 30000, 0.002),
(0.2, 6.0, 1.442, -0.942, 1.452, -0.89, 30000, 0.002),
(0.3, 6.0, 2.465, -1.376, 2.489, -1.298, 30000, 0.002)],
dtype=[('p', '<f8'), ('T', '<f8'), ('P', '<f8'), ('U', '<f8'), ('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])
Now those two fields are floats.
But converters can't return two numbers, so I can't keep the uncertainty this way.
Another approach is to pass the lines through a filter function
def splt(astr):
strs=astr.split()
def foo(astr):
if b'(' in astr:
astr = astr.strip(b')').split(b'(')
return b','.join(astr)
return astr
return b','.join([foo(a) for a in strs])
In [26]: [splt(line) for line in txt]
Out[26]:
[b'#,p*,T*,P*,U*,P*_cs,U*_cs,Steps,dt*',
b'0.1,6.0,0.6499,6,-0.478,2,0.6525,-0.452,30000,0.002',
b'0.2,6.0,1.442,1,-0.942,2,1.452,-0.890,30000,0.002',
b'0.3,6.0,2.465,3,-1.376,1,2.489,-1.298,30000,0.002']
To use this I have to skip the header because the new lines have two added columns
In [28]: data=np.genfromtxt([splt(line) for line in txt], delimiter=',',dtype=None, skip_header=1)
In [29]: data
Out[29]:
array([(0.1, 6.0, 0.6499, 6, -0.478, 2, 0.6525, -0.452, 30000, 0.002),
(0.2, 6.0, 1.442, 1, -0.942, 2, 1.452, -0.89, 30000, 0.002),
(0.3, 6.0, 2.465, 3, -1.376, 1, 2.489, -1.298, 30000, 0.002)],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'),
('f4', '<f8'), ('f5', '<i4'), ('f6', '<f8'), ('f7', '<f8'),
('f8', '<i4'), ('f9', '<f8')])
But I could modify the original dtype
to make the 2 fields (sub)arrays:
In [30]: dt=np.dtype([('p', '<f8'), ('T', '<f8'), ('P', '<f8',(2,)),
('U', '<f8',(2,)), ('P_cs', '<f8'), ('U_cs', '<f8'),
('Steps', '<i4'), ('dt', '<f8')])
In [31]: data = np.genfromtxt((splt(line) for line in txt), delimiter=',',dtype=dt, skip_header=1)
In [32]: data
Out[32]:
array([(0.1, 6.0, [0.6499, 6.0], [-0.478, 2.0], 0.6525, -0.452, 30000, 0.002),
(0.2, 6.0, [1.442, 1.0], [-0.942, 2.0], 1.452, -0.89, 30000, 0.002),
(0.3, 6.0, [2.465, 3.0], [-1.376, 1.0], 2.489, -1.298, 30000, 0.002)],
dtype=[('p', '<f8'), ('T', '<f8'), ('P', '<f8', (2,)), ('U', '<f8', (2,)),
('P_cs', '<f8'), ('U_cs', '<f8'), ('Steps', '<i4'), ('dt', '<f8')])
Such a field would look like:
In [33]: data['P']
Out[33]:
array([[ 0.6499, 6. ],
[ 1.442 , 1. ],
[ 2.465 , 3. ]])
I could define other dtypes
, just as long as the number of fields match.
With a file, rather than these text lines, I would use something like (not tested):
with open(filename,'wb') as f:
data = np.genfromtxt((splt(line) for line in f),...
Here, and above, I'm using the generator expression (splt(line) for line in x)
, though the list comprehension would be fine. Any code that opens the file and yields/returns the modified lines will work.