Search code examples
pythonpandasnumpyvaex

ValueError: setting an array element with a sequence in Vaex dataframe


I've been given a CSV file from a previous project and I'm supposed to prepare some scripts with Python to plot the value it contains. The dataset in this CSV file holds data from electric and vibration signals. The data I'm interested in is stored at a column, "DecompressedValue", where each row holds a 16.000-elements-long array of float values, which represents a vibration/electric signal.

I want to use Vaex to exploit its higher performance features, but I found what I think is a bug when processing the signals. I started adapting code which works in Pandas.

import pandas as pd
import json 
signal_df = pd.read_csv('csv_test.csv', sep=';')
# The DecompressedValue column, despite being stored as a regular array, is read a long string, so in order to turn it into an array, json.loads() has to be applied to each value of the column
signal_df.DecompressedValue = signal_df.DecompressedValue.apply(lambda r: json.loads(r))

However, when trying to replicate the same functionality in Vaex, even if this code runs correctly, trying to access the dataframe after that produces an error (find vaex_test.csv for testing this code here).

import vaex

test = vaex.from_csv('vaex_test.csv', sep=';')
test['DecompressedValue'] = test['DecompressedValue'].apply(lambda r: json.loads(r))
test.head()

This produce a ValueError:

[12/19/24 12:50:48] ERROR    error evaluating: DecompressedValue at rows 0-5                      [dataframe.py](file:///C:/Users/user/AppData/Local/anaconda3/envs/py310env/lib/site-packages/vaex/dataframe.py):[4101](file:///C:/Users/user/AppData/Local/anaconda3/envs/py310env/lib/site-packages/vaex/dataframe.py#4101)
                             multiprocessing.pool.RemoteTraceback:                                                 
                             """                                                                                   
                             Traceback (most recent call last):                                                    
                               File                                                                                
                             "c:\Users\user\AppData\Local\anaconda3\envs\py310env\lib\mu                  
                             ltiprocessing\pool.py", line 125, in worker                                           
                                 result = (True, func(*args, **kwds))                                              
                               File                                                                                
                             "c:\Users\user\AppData\Local\anaconda3\envs\py310env\lib\si                  
                             te-packages\vaex\expression.py", line 1629, in _apply                                 
                                 result = np.array(result)                                                         
                             ValueError: setting an array element with a sequence. The requested                   
                             array has an inhomogeneous shape after 1 dimensions. The detected                     
                             shape was (5,) + inhomogeneous part.                                                  
                             """

I've reviewed questions with the same error but I don't think they are applicable since those questions are usually related to numpy arrays and I feel my problem is more related to Vaex idiosyncrasy.


Solution

  • DataFrames in Pandas and Vaex are different.

    To get the lists in your csv file inside Vaex DataFrame as lists and not strings, one way would be to let Pandas do the formatting and use vaex from_pandas:

    test_pd = pd.read_csv('vaex_test.csv')
    test_pd['DecompressedValue'] = test_pd['DecompressedValue'].apply(lambda r: json.loads(r)) 
    
    test = vaex.from_pandas(test_pd)
    
    print(test.head())
    print(type(test['DecompressedValue'] ))
    print(test[3])
    print(test[3][0])  # 4th list from csv
    print(test[3][0][0])
    
     #  DecompressedValue
       0  '[-0.004518906585872173, -0.004478906746953726, ...
       1  '[-0.0005845219711773098, -0.0002945219748653471...
       2  '[-0.006645397283136845, -0.006435397081077099, ...
       3  '[0.003976251929998398, 0.0019852519035339355, 0...
       4  '[0.003452450269833207, 0.0017284504137933254, 0...
       
     <class 'vaex.expression.Expression'>
     
     [[0.003976251929998398, 0.0019852519035339355, ...
     
     [0.003976251929998398, 0.0019852519035339355, ...
    
    0.003976251929998398