Search code examples
pythonnumpyh5py

Applying names and formats to numpy array


I'm trying to combine two arrays (5000,2) containing integers and (5000,7) containing float. I need to assign names to act as column headers when written to h5, however when I try to assign names and data types every column in the array gets repeated 9 times.

My code is as follows:

namesList = ['EID', 'Domain', 'LAM ID1', 'LAM ID2','LAM ID3', 'LAM ID4', 'LAM ID5', 'LAM ID6', 'LAM ID7']
formatsList = ['int', 'int', 'float', 'float', 'float', 'float', 'float', 'float', 'float']

ds_dt = np.dtype({'names':namesList, 'formats':formatsList})

Final_Lam_Strength = np.concatenate((LAM_Strength_RFs_Data, LAM_Strength_RFs), axis=1).astype(ds_dt)

Thanks


Solution

  • Load data directly to HDF5
    If your goal is to load the data to HDF5 with h5py, there's no need to duplicate the data in another array. You can do it directly by creating the dataset then adding the data. The procedure is shown below with some simple data I created:

    namesList = ['EID', 'Domain', 'LAM ID1', 'LAM ID2','LAM ID3', 'LAM ID4', 'LAM ID5', 'LAM ID6', 'LAM ID7']
    formatsList = ['int', 'int', 'float', 'float', 'float', 'float', 'float', 'float', 'float']
    
    ds_dt = np.dtype({'names':namesList, 'formats':formatsList})
    
    # the simple data
    nrows, nints, nfloats = 5,2,7
    LAM_Strength_RFs_Data = np.arange(nrows*nints).reshape(nrows,nints)
    LAM_Strength_RFs = np.arange(nrows*nfloats).reshape(nrows,nfloats)
    
    with h5py.File('SO_77346149.h5', 'w') as h5f:
        ds = h5f.create_dataset('Final_Lam_Strength',shape=(nrows,),dtype=ds_dt)   
        for i in range(nints):
            ds[namesList[i]] = LAM_Strength_RFs_Data[:,i]  
        for i in range(nfloats):   
            ds[namesList[i+2]] = LAM_Strength_RFs[:,i]
    

    Create NumPy structured array
    Now, if you really need a NumPy array, create it with with np.empty() and define the shape with the number of rows and the dtype with ds_dt. Then load the data using the named fields and column references.

    This continues with data from example above:

    Final_Lam_Strength = np.empty(shape=(nrows,),dtype=ds_dt)
    print(Final_Lam_Strength.dtype, Final_Lam_Strength.shape)
    
    for i in range(nints):
        Final_Lam_Strength[namesList[i]] = LAM_Strength_RFs_Data[:,i]
    
    for i in range(nfloats):   
        Final_Lam_Strength[namesList[i+2]] = LAM_Strength_RFs[:,i]
    
    print(Final_Lam_Strength[0]) # first row
    print(Final_Lam_Strength[-1]) # last row
    print(Final_Lam_Strength['Domain']) # 'Domain' column
    

    Create NumPy record array
    Function np.core.records.fromarrays is mentioned in the comments above. For completeness, here is an alternate method using that method. You do NOT need to create the empty array before calling this function.

    arrayList = [LAM_Strength_RFs_Data[:,0], LAM_Strength_RFs_Data[:,1]] + \
                [LAM_Strength_RFs[:,i] for i in range(nfloats)]
                 
    Final_Lam_Strength = np.core.records.fromarrays(arrayList, dtype=ds_dt)
    print(Final_Lam_Strength.dtype, Final_Lam_Strength.shape)
    print(Final_Lam_Strength[0]) # first row
    print(Final_Lam_Strength[-1]) # last row
    print(Final_Lam_Strength.Domain) # access 'Domain' column by attribute name
    

    Notes on structured vs record arrays
    The 1st NumPy method creates a structured array, and the 2nd method method creates a record array. They are similar, but slightly different. Record arrays provide access to the field (column) using dot notation. Example print statements provided to show the difference. Also, the np.core.records.fromarrays method requires an intermediate data structure (the list of arrays). This won't be a problem with your data, but could be an issue if you have a lot of data (say 10E6 rows and 200 columns).