Search code examples
pythonpandashdf5vaex

Columns not showing in Hdf5 file


I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for each csv file). If i use the pandas.read_hdf function to load a slice of data, there is no problem.

    x   y   z
0   -8274.591528    36.053843   24.766887
1   -8273.229203    34.853409   21.883050
2   -8289.577896    15.326737   26.041516
3   -8279.589741    27.798428   26.222326
4   -8272.836821    37.035071   24.795912
...     ...     ...     ...
995     -8258.567634    3.581020    23.955874
996     -8270.526953    4.373765    24.381293
997     -8287.429578    1.674278    25.838418
998     -8250.624879    4.884777    21.815401
999     -8287.115655    1.100695    25.931318

1000 rows × 3 columns

This is how it looks like, i can access to any column i want, and the shape is (1000,3) as it should be. However, when i try to load the hdf5 file using vaex.open function:

 #  table
0   '(0, [-8274.59152784, 36.05384262, 24.7668...
1   '(1, [-8273.22920299, 34.85340869, 21.8830...
2   '(2, [-8289.5778959 , 15.32673748, 26.0415...
3   '(3, [-8279.58974054, 27.79842822, 26.2223...
4   '(4, [-8272.83682085, 37.0350707 , 24.7959...
...     ...
1,322,286,736   '(2792371, [-6781.56835851, 2229.30828904, -6...
1,322,286,737   '(2792372, [-6781.71119626, 2228.78749838, -6...
1,322,286,738   '(2792373, [-6779.3251589 , 2227.46826613, -6...
1,322,286,739   '(2792374, [-6777.26078082, 2229.49535808, -6...
1,322,286,740   '(2792375, [-6782.81758335, 2228.87820639, -6...

This is what I'm getting. The shape is (1322286741, 1) and only column is 'table'. When i try to call the vaex imported hdf as galacto[0]:

[(0, [-8274.59152784,    36.05384262,    24.76688728])]

In pandas imported data these are x,y,z columns for the first row. When i tried to inspect the data in another problem, it also gave an error saying no data has found. So i think the problem is pandas appending hdf5 files row by row and it doesn't work in other programs. Is there a way i can fix this issue?


Solution

  • hdf5 is as flexible as say JSON and xml, in that you can store data in any way you want. Vaex has its own way of storing the data (you can check with the h5ls utils the structure, it's very simple) that does not align with how Pandas/PyTables stores it.

    Vaex stores each column as a single contiguous array, which is optimal if you don't work with all columns, and makes it easy to memory map to a (real) numpy array. PyTables stores each row (at least of the same type) next to each other. Meaning if you would calculate the mean of the x columns, you effectively go over all the data.

    Since PyTables hdf5 is probably already much faster to read than CSV, I suggest you do the following (not tested, but it should get the point across):

    import vaex
    import pandas as pd
    import glob
    # make sure dir vaex exists
    for filename in glob.glob("pandas/*.hdf5"):  # assuming your files live there
        pdf = pd.read_hdf(filename)
        df = vaex.from_pandas(pdf)  # now df is a vaex dataframe
        df.export(filename.replace("pandas", "vaex"), progress=True)) # same in vaex' format
    
    df = vaex.open("vaex/*.hdf5")  # it will be concatenated
    # don't access df.x.values since it's not a 'real' numpy array, but 
    # a lazily concatenated column, so it would need to memory copy.
    # If you need that, you can optionally do (and for extra performance)
    # df.export("big.hdf5", progress=True)
    # df_single = vaex.open("big.hdf5")
    # df_single.x.values  # this should reference the original data on disk (no mem copy)