I am trying to write a program to convert a CSV file to a very specific binary file output. It must be written in big endian format with a variety of datatypes, both unsigned integers and floats. I have successfully imported the CSV into a pandas dataframe.
Here is the sample data:
val1,val2,val3,val4
1234567890,10000,1,0.839792631
And here is the code I am using:
import numpy as np
import pandas as pd
inputfilename = r"test_csv.csv"
df = pd.read_csv(inputfilename)
datatype = np.dtype([
('val1', '>u4'),
('val2', '>u2'),
('val3', 'u1'),
('val4', '>f4')])
data = df.to_numpy(dtype=datatype)
outputfilename = r"output_py_1.dat"
fileobj = open(outputfilename, mode='wb')
data.tofile(fileobj)
fileobj.close()
I've written code to do this same thing in Matlab and verified it in a hex editor. The correct output is:
49 96 02 D2 27 10 01 3F 56 FC A6 00
However, Python outputs many extraneous bytes and repeats some bytes, and I don't understand why.
49 96 02 D2 02 D2 D2 4E 93 2C 06 00 00 27 10 27 10 10 46 1C 40 00 00 00 00 01 00 01 01 3F 80 00 00 00 00 00 00 00 00 00 3F 56 FC A6 F2
Is there some way I can get the output correct?
I'm also thinking the issue may have something to do with the conversion to numpy, given that the output for data looks like this with a bunch of extra numbers (I don't know where those are even coming from):
array([[(1234567890, 722, 210, 1.234568e+09),
( 10000, 10000, 16, 1.000000e+04),
( 1, 1, 1, 1.000000e+00),
( 0, 0, 0, 8.397926e-01)]],
dtype=[('val1', '>u4'), ('val2', '>u2'), ('val3', 'u1'), ('val4', '>f4')])
It turns out numpy arrays can only have one datatype, so it was trying to apply each datatype to each value -- hence the 4x4 array -- when I did .to_numpy(datatype). It was then writing that 4x4 array, resulting in the extra bytes.
Since pandas dataframes are based on numpy arrays anyway, it seems the answer is to specify the datatype on reading from CSV, then get the records from the dataframe and write those to binary.
import numpy as np
import pandas as pd
inputfilename = r"test_csv.csv"
datatype = np.dtype([
('val1', '>u4'),
('val2', '>u2'),
('val3', 'u1'),
('val4', '>f4')])
df = pd.read_csv(inputfilename,dtype=datatype)
dataonly = df.to_records(index=False)
outputfilename = r"output_py_1.dat"
fileobj = open(outputfilename, mode='wb')
dataonly.tofile(fileobj)
fileobj.close()
Edit: One more note -- if the data resists being labeled as big endian:
import sys
if (sys.byteorder == 'little'):
dataonly = dataonly.byteswap()