Search code examples
pythonnumpyfilesizefile-typemat

File size increases after converting from .mat files to .txt files


I have a lot of .mat files which contain the information about the radial part of some different wavefunctions and some other information about an atom. Now I successfully extracted the wavefunction part and using numpy.savetxt() to save it into .txt file. But the size of the file increases so much: After I ran

    du -ch wfkt_X_rb87_n=40_L=11_J=0_step=0.001.mat
    440K    wfkt_X_rb87_n=40_L=11_J=0_step=0.001.mat
    du -ch wfkt_X_rb87_n=40_L=12_J=0_step=0.001.txt
    2,9M    wfkt_X_rb87_n=40_L=12_J=0_step=0.001.txt

Ignore the L=11 and L=12 difference, the size of the wavefunctions are almost the same, but the file size increased by more than 6 times. I want to know the reason why and probably a way to decrease the size of the .txt files. Here is the code how I covert the files:

    import scipy.io as sio
    import os
    import pickle
    import numpy as np
    import glob as gb
    files=gb.glob('wfkt_X_rb*.mat')
    for filet in files:
            print filet
            mat=sio.loadmat(filet)
            wave=mat['wavefunction'][0]
            J=mat['J']
            L=mat['L']
            n=mat['n']
            xmax=mat['xmax'][0][0]
            xmin=mat['xmin'][0][0]
            xstep=mat['xstep'][0][0]
            energy=mat['energy'][0][0]
            name=filet.replace('.mat','.txt')
            name=name.replace('rb','Rb')
            x=np.linspace(xmin, xmax, num=len(wave), endpoint=False)
            Data=np.transpose([x,wave])
            np.savetxt(name,Data)
            os.remove(filet)
            with open(name, "a") as f:
                    f.write(str(energy)+" "+str(xstep)+"\n")
                    f.write(str(xmin)+" "+str(xmax))

and the format of the data file needed is :

    2.700000000000000000e+01 6.226655250941872093e-04
    2.700099997457605738e+01 6.232789496263042460e-04
    2.700199994915211121e+01 6.238928333406641843e-04
    2.700299992372816860e+01 6.245071764542571872e-04
    2.700399989830422243e+01 6.251219791839867897e-04
    2.700499987288027981e+01 6.257372417466700075e-04
    2.700599984745633364e+01 6.263529643590372287e-04

If you need more information, feel free to ask! Thanks in advance.


Solution

  • .mat is a binary format whereas numpy.savetxt() writes a plain text file. The binary representation of a double precision number (IEEE 754 double precision) takes 8 bytes. By default, numpy saves this as plain text in the format 0.000000000000000000e+00 resulting in 24 bytes.

    There are number of additional effects which affect the resulting file size. E.g. structural overhead of the file format, compression, the format you use for writting the plain text (number of decimal digits). However in your case, i suspect that the main effect is just the difference between a binary and a plain text representation of the numbers.

    If you want to decrease the file size, you should use a different output format. Possible options are:

    • write a zipped text file:

      import gzip
      with open('data.txt.gz', 'wb') as f:
          numpy.savetxt(f, myarray)
      
    • Save as .mat again. See scipy.io.savemat()

    • Write a proprietary binary numpy format (.npy). See numpy.save()
    • Write a proprietary binary compressed numpy format (.npz). See numpy.savez_compressed()
    • If you have very large amounts of structured data, consider usering the HDF5 file format.
    • If you need to write your own binary format use struct.pack() and write the resulting bytes to a file.

    Which option to choose depends on your situation: Who will have to read the data afterwards? How important is the compression factor? Is your data just one single array or is the structure more complex?