Search code examples
pythonnumpyfile-readfromfile

How to skip lines when reading a file with numpy.fromfile?


I am reading a .pksc file that contains coordinates and velocities of a large number of astronomical objects. I am doing the reading with

import numpy as np
f=open('halos_10x10.pksc') 
data = np.fromfile(f,count=N*10,dtype=np.float32)

The file can be found here. It is very large and I want to skip the first m objects (the first m lines corresponding to these objects, if there are lines in the file). How can I do this, I see nowhere an option for skipping? Optionally, it would be also nice to be able to skip the last k objects from the file. Tnx!


Solution

  • The first thing to appreciate is that your PKSC file is binary and is one continuous string of bytes with no discernible breaks in the data.

    Text files on the other hand have lines that are clearly delimited by some line-break character, so it's pretty easy to read a line at a time and ignore M lines at the front, then read the remaining number of lines you care about: REMAINING_LINES = ALL_LINES - M_LINES - K_LINES.

    np.fromfile() reads the binary file an item at a time.

    To do this, it needs the dtype= param to tell the reader how big an item is. In the case of a PKSC file, we represent an item as a 32-bit integer, np.int32.

    I searched and searched but could not find a spec for the file. Fortunately the link you provided has a sample Python script for reading the file; and I also found a well-documented Python lib for dealing with these kinds of files (websk.py, linked below).

    I've learned that the PKSC file has the following properties:

    • The first 3 items are header items:
      • the first header item is a count of the records of relevant data that follow the header items
    • each relevant data record contains 10 items

    np.fromfile() also takes the count= param as an instruction of exactly how many items to read.

    Here's how to read the 3 header items, get the total number of Halo records that follow, and read the first two records (10 items per record):

    Nitems_per_record = 10
    
    f = open('halos_10x10.pksc')
    
    headers = np.fromfile(f, dtype=np.int32, count=3)
    print(f'Headers: {headers}')
    print(f'This file contains {headers[0]} records with Halo data')
    
    record1 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
    print(f'First record:\n{record1}')
    
    record2 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
    print(f'Second record:\n{record2}')
    
    Headers: [2079516 2079516 2079516]
    This file contains 2079516 records with Halo data
    First record:
    [ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
    Second record:
    [ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]
    

    According to websky.py, the 2nd and 3rd header items also have relevant values, maybe you care about these too? I've synthesized the following from that code:

    RTHMAXin    = headers[1]
    redshiftbox = headers[2]
    

    Reading more than one record at a time requires re-shaping the data. To read 3 records:

    f = open('halos_10x10.pksc')
    
    np.fromfile(f, dtype=np.int32, count=3)  # reading, but ignoring header items
    
    three_records = np.fromfile(f, dtype=np.int32, count=3*Nitems_per_record)
    print(f'Initial:\n{three_records}')
    
    reshaped_records = np.reshape(three_records, (3, Nitems_per_record))
    print(f'Re-shaped:\n{reshaped}')
    
    Initial:
    [ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585
      1086444250  1170064687 -1011110709 -1006510502  1170083367 -1013908122
     -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354
     -1013985376 -1006510502  1169622353 -1009409432 -1006678295 -1045415727
     -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]
    Re-shaped:
    [[ 1170060708 -1011158654 -1006515961 -1022926100  1121164875  1110446585  1086444250  1170064687 -1011110709 -1006510502]
     [ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]
     [ 1169622353 -1009409432 -1006678295 -1045415727 -1017794908 -1051267742  1084874393  1169623221 -1009509109 -1006675510]]
    

    So, what about skipping?

    Just trim the reshaped data

    The easiest thing to do would be read all the data and then trim from the front and back what you don't want:

    m = 1
    k = 1 * -1
    trimmed_records = reshaped_records[m:k]
    print(f'Trimmed:\n{trimmed_records}')
    
    Trimmed:
    [[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197  1084104229  1170090354 -1013985376 -1006510502]]
    

    I'm not sure why you want to skip, but that's the easiest to understand and implement.

    If you're conern is memory, read on.

    Discard M records, read less K+M records

    The next option, as I see it, is to:

    1. take the record count (A records) from the first header
    2. read and ignore M records
    3. figure out how many remaining records you have to read, given that you've already read M records and want to stop at record K: R = A - M - K

    You'll only save a little memory by ignoring M records; the data is still read and interpreted. You'll definitely save on memory by not reading records K to the end:

    f = open('halos_10x10.pksc')
    headers = np.fromfile(f, dtype=np.int32, count=3)
    
    Arecords = headers[0]
    Mrecords = 1_000_000
    Krecords = 1_000_000
    
    Nitems = Mrecords * Nitems_per_record
    np.fromfile(f, dtype=np.int32, count=Nitems)
    
    Rrecords = Arecords - Mrecords - Krecords  # Remaining records to read
    Nitems = Rrecords * Nitems_per_record
    data = np.fromfile(f, dtype=np.int32, count=Nitems)
    data = np.reshape(data, (Rrecords, Nitems_per_record))
    
    print(f'From {Arecords} to {Rrecords} records:\n{data.shape}')
    
    From 2079516 to 79516 records:
    (79516, 10)