Numpy, Reading from File with no delimiter but fixed pattern

I tried searching for this question but I could not find answers that did not seem too complicated.

I am reading from a file that only has space delimiters. The columns are not fixed width. The first two columns are what are giving me the issue. It is 15 columns, where the first two are strings and everything else are floating numbers.

I try using numpy's "genfromtxt" and specified the dtype. However, some of the string entries are empty or contain numbers, so so lines are misread as having 15 or 17 entries.

Here is an example of a few lines lines.

NGC 104    47 Tuc       00 24 05.67  -72 04 52.6   305.89  -44.89    4.5   7.4   1.9  -2.6  -3.1
NGC 288                 00 52 45.24  -26 34 57.4   152.30  -89.38    8.9  12.0  -0.1   0.0  -8.9
NGC 362                 01 03 14.26  -70 50 55.6   301.53  -46.25    8.6   9.4   3.1  -5.1  -6.2
Whiting 1               02 02 57     -03 15 10     161.22  -60.76   30.1  34.5 -13.9   4.7 -26.3

How should I approach this? Should I reformat the text by rereading it and then outputting it as a CSV? Should I read as a regex? Can I fix this command:

data = np.genfromtxt('PositionalData.txt', skiprows=0, missing_values=(' '), dtype=['S6','S6', 'f4', 'f4', 'f4', 'f4', 'f4', 'f4', 'f5','f4','f4', 'f4', 'f4', 'f4', 'f4'])

Thanks, help would be much appreciated.

edit:

Here is some output after using some fixed-width setting:

(' NG', 'C 1', 0.0, 4.0, nan, nan, nan, nan, 4.0, 7.0, nan, nan, nan, nan, nan)
(' NG', 'C 2', 8.0, 8.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' NG', 'C 3', 6.0, 2.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' Wh', 'iti', nan, nan, nan, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' NG', 'C 1', 2.0, 6.0, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)
(' Pa', 'l 1', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan)

The Command is data = np.genfromtxt('PositionalDataTest.txt', skiprows=0,delimiter=(3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), missing_values=(' '), dtype=['S7','S7', 'f4', 'f4', 'f4', 'f4', 'f4', 'f4', 'f5','f4','f4', 'f4', 'f4', 'f4', 'f4'])

The lines are:

NGC 104    47 Tuc       00 24 05.67  -72 04 52.6   305.89  -44.89    4.5   7.4   1.9  -2.6  -3.1
NGC 288                 00 52 45.24  -26 34 57.4   152.30  -89.38    8.9  12.0  -0.1   0.0  -8.9
NGC 362                 01 03 14.26  -70 50 55.6   301.53  -46.25    8.6   9.4   3.1  -5.1  -6.2
Whiting 1               02 02 57     -03 15 10     161.22  -60.76   30.1  34.5 -13.9   4.7 -26.3
NGC 1261                03 12 16.21  -55 12 58.4   270.54  -52.12   16.3  18.1   0.1 -10.0 -12.9
Pal 1                   03 33 20.04  79 34 51.8   130.06   19.03   11.1  17.2  -6.8   8.1   3.6

Solution

Consider this portion of the data file:

-72 04 52.6 
-26 34 57.4 
-70 50 55.6 
-03 15 10   
-55 12 58.4 
79 34 51.8

It can be parsed like this:

In [75]: np.genfromtxt('data2', delimiter=[3,3,5], dtype=None).tolist()
Out[75]: 
[(-72, 4, 52.6),
 (-26, 34, 57.4),
 (-70, 50, 55.6),
 (-3, 15, 10.0),
 (-55, 12, 58.4),
 (79, 34, 51.8)]

The rest of the file could be parsed similarly, the difficulty is in finding the right column widths to use in delimiter.

That's laborious, and I'd rather not do that because this solution is fragile. It is quite possible your data truly is not parseable using fixed-width columns.

So instead let's shoot for a robust solution. np.genfromtxt can accept any iterable of strings as its first argument. So we can bring the full power of Python string manipulation to bear on the problem by simply defining a generator function to pre-process the lines from the file. The price we pay for all this power is that calling a Python function once per line will be much much slower than the C code np.genfromtxt uses when parsing files with a simple delimiter or fixed-width columns.

import numpy as np
def process(iterable):
    for line in iterable:
        parts = [line[:11], line[11:24]] + line[24:].split()
        yield '@'.join(parts)
with open('data', 'rb') as f:
    data = np.genfromtxt(process(f), dtype=None, delimiter='@')

print(repr(data))

yields

array([ ('NGC 104    ', '47 Tuc       ', 0, 24, 5.67, -72, 4, 52.6, 305.89, -44.89, 4.5, 7.4, 1.9, -2.6, -3.1),
       ('NGC 288    ', '             ', 0, 52, 45.24, -26, 34, 57.4, 152.3, -89.38, 8.9, 12.0, -0.1, 0.0, -8.9),
       ('NGC 362    ', '             ', 1, 3, 14.26, -70, 50, 55.6, 301.53, -46.25, 8.6, 9.4, 3.1, -5.1, -6.2),
       ('Whiting 1  ', '             ', 2, 2, 57.0, -3, 15, 10.0, 161.22, -60.76, 30.1, 34.5, -13.9, 4.7, -26.3),
       ('NGC 1261   ', '             ', 3, 12, 16.21, -55, 12, 58.4, 270.54, -52.12, 16.3, 18.1, 0.1, -10.0, -12.9),
       ('Pal 1      ', '             ', 3, 33, 20.04, 79, 34, 51.8, 130.06, 19.03, 11.1, 17.2, -6.8, 8.1, 3.6)], 
      dtype=[('f0', 'S11'), ('f1', 'S13'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<f8'), ('f5', '<i8'), ('f6', '<i8'), ('f7', '<f8'), ('f8', '<f8'), ('f9', '<f8'), ('f10', '<f8'), ('f11', '<f8'), ('f12', '<f8'), ('f13', '<f8'), ('f14', '<f8')])

Note that the process function uses '@' as the delimiter between columns. If the data contains '@' you will have to choose some other character for the delimiter.