Search code examples
pythonpandaspadding

How to efficiently zero pad datasets with different lengths


My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:

|server|      users     |      power     |   Throughput range   |  time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0    |   [5, 3,4,1]   |   -4.2974843   |  [5.23243, 5.2974843]|   0   |                                                        
| 1    |   [8, 6,2,7]   |   -6.4528433   |  [6.2343, 7.0974845] |   1   |                                                                                                                              
| 2    |   [9,12,10,11] |   -3.5322451   |  [4.31240, 4.9073840]|   2   |                                         
| 3    |   [14,13,16,17]|   -5.9752843   |  [5.2243, 5.2974843] |   3   |                                            
| 0    |   [22,18,19,21]|   -1.2974652   |  [3.12843, 4.2474643]|   4   |                                           
| 1    |   [22,23,24,25]|   -9.884843    |  [8.00843, 8.0974843]|   5   |                                                                             
| 2    |   [27,26,28,29]|   -2.3984843   |  [7.23843, 8.2094845]|   6   |
| 3    |   [30,32,31,33]|   -4.5654566   |  [3.1233, 4.2474643] |   7   |
| 1    |   [36,34,37,35]|   -1.2974652   |  [3.12843, 4.2474643]|   8   |
| 2    |   [40,41,38,39]|   -3.5322451   |  [4.31240, 4.9073840]|   9   |
| 1    |   [42,43,45,44]|   -5.9752843   |  [6.31240, 6.9073840]|   10  |

The aim is to analyze individual servers by their respective data which was done using the code below:

c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
     

The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:

| server |      users     |      power     |   Throughput range   |  time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0      |   [5, 3,4,1]   |   -4.2974843   |  [5.23243, 5.2974843]|   0   |                                                        
| 0      |   [22,18,19,21]|   -1.2974652   |  [3.12843, 4.2474643]|   1   |

The results obtained for individual servers have different lengths so I tried padding using the code below:

from Keras.preprocessing.sequence import pad_sequences

man = [server0, server1, server2, server3]
new = pad_sequences(man)
                  

The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?


Solution

  • The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.

    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import MinMaxScaler
    
    man = [server0, server1, server2, server3]
    
    for cel in man:
    cel.set_index('time', inplace=True)
    cel.drop(['users'], axis=1, inplace=True)
    
    
    scl = MinMaxScaler()
    vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
    

    I then applied the the pad sequence and it worked as follows:

    from keras.preprocessing.sequence import pad_sequences
    new = pad_sequences(vals)