I am reading a .pksc
file that contains coordinates and velocities of a large number of astronomical objects. I am doing the reading with
import numpy as np
f=open('halos_10x10.pksc')
data = np.fromfile(f,count=N*10,dtype=np.float32)
The file can be found here. It is very large and I want to skip the first m
objects (the first m
lines corresponding to these objects, if there are lines in the file). How can I do this, I see nowhere an option for skipping? Optionally, it would be also nice to be able to skip the last k
objects from the file. Tnx!
The first thing to appreciate is that your PKSC file is binary and is one continuous string of bytes with no discernible breaks in the data.
Text files on the other hand have lines that are clearly delimited by some line-break character, so it's pretty easy to read a line at a time and ignore M lines at the front, then read the remaining number of lines you care about: REMAINING_LINES = ALL_LINES - M_LINES - K_LINES
.
np.fromfile()
reads the binary file an item at a time.
To do this, it needs the dtype=
param to tell the reader how big an item is. In the case of a PKSC file, we represent an item as a 32-bit integer, np.int32
.
I searched and searched but could not find a spec for the file. Fortunately the link you provided has a sample Python script for reading the file; and I also found a well-documented Python lib for dealing with these kinds of files (websk.py, linked below).
I've learned that the PKSC file has the following properties:
np.fromfile()
also takes the count=
param as an instruction of exactly how many items to read.
Here's how to read the 3 header items, get the total number of Halo records that follow, and read the first two records (10 items per record):
Nitems_per_record = 10
f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)
print(f'Headers: {headers}')
print(f'This file contains {headers[0]} records with Halo data')
record1 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'First record:\n{record1}')
record2 = np.fromfile(f, dtype=np.int32, count=Nitems_per_record)
print(f'Second record:\n{record2}')
Headers: [2079516 2079516 2079516]
This file contains 2079516 records with Halo data
First record:
[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585 1086444250 1170064687 -1011110709 -1006510502]
Second record:
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]
According to websky.py, the 2nd and 3rd header items also have relevant values, maybe you care about these too? I've synthesized the following from that code:
RTHMAXin = headers[1]
redshiftbox = headers[2]
Reading more than one record at a time requires re-shaping the data. To read 3 records:
f = open('halos_10x10.pksc')
np.fromfile(f, dtype=np.int32, count=3) # reading, but ignoring header items
three_records = np.fromfile(f, dtype=np.int32, count=3*Nitems_per_record)
print(f'Initial:\n{three_records}')
reshaped_records = np.reshape(three_records, (3, Nitems_per_record))
print(f'Re-shaped:\n{reshaped}')
Initial:
[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585
1086444250 1170064687 -1011110709 -1006510502 1170083367 -1013908122
-1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354
-1013985376 -1006510502 1169622353 -1009409432 -1006678295 -1045415727
-1017794908 -1051267742 1084874393 1169623221 -1009509109 -1006675510]
Re-shaped:
[[ 1170060708 -1011158654 -1006515961 -1022926100 1121164875 1110446585 1086444250 1170064687 -1011110709 -1006510502]
[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]
[ 1169622353 -1009409432 -1006678295 -1045415727 -1017794908 -1051267742 1084874393 1169623221 -1009509109 -1006675510]]
The easiest thing to do would be read all the data and then trim from the front and back what you don't want:
m = 1
k = 1 * -1
trimmed_records = reshaped_records[m:k]
print(f'Trimmed:\n{trimmed_records}')
Trimmed:
[[ 1170083367 -1013908122 -1006498824 -1014626384 -1020456945 -1033004197 1084104229 1170090354 -1013985376 -1006510502]]
I'm not sure why you want to skip, but that's the easiest to understand and implement.
If you're conern is memory, read on.
K+M
recordsThe next option, as I see it, is to:
A
records) from the first headerM
recordsM
records and want to stop at record K
: R = A - M - K
You'll only save a little memory by ignoring M
records; the data is still read and interpreted. You'll definitely save on memory by not reading records K
to the end:
f = open('halos_10x10.pksc')
headers = np.fromfile(f, dtype=np.int32, count=3)
Arecords = headers[0]
Mrecords = 1_000_000
Krecords = 1_000_000
Nitems = Mrecords * Nitems_per_record
np.fromfile(f, dtype=np.int32, count=Nitems)
Rrecords = Arecords - Mrecords - Krecords # Remaining records to read
Nitems = Rrecords * Nitems_per_record
data = np.fromfile(f, dtype=np.int32, count=Nitems)
data = np.reshape(data, (Rrecords, Nitems_per_record))
print(f'From {Arecords} to {Rrecords} records:\n{data.shape}')
From 2079516 to 79516 records:
(79516, 10)