I have a large data set (1.3 billion data) that i want to visualize with Vaex. Since the data set was very big in csv (around 130gb in 520 separate file), i merged them in a hdf5 file with pandas dataframe.to_hdf function (format:table, appended for each csv file). If i use the pandas.read_hdf function to load a slice of data, there is no problem.
x y z
0 -8274.591528 36.053843 24.766887
1 -8273.229203 34.853409 21.883050
2 -8289.577896 15.326737 26.041516
3 -8279.589741 27.798428 26.222326
4 -8272.836821 37.035071 24.795912
... ... ... ...
995 -8258.567634 3.581020 23.955874
996 -8270.526953 4.373765 24.381293
997 -8287.429578 1.674278 25.838418
998 -8250.624879 4.884777 21.815401
999 -8287.115655 1.100695 25.931318
1000 rows × 3 columns
This is how it looks like, i can access to any column i want, and the shape is (1000,3) as it should be. However, when i try to load the hdf5 file using vaex.open function:
# table
0 '(0, [-8274.59152784, 36.05384262, 24.7668...
1 '(1, [-8273.22920299, 34.85340869, 21.8830...
2 '(2, [-8289.5778959 , 15.32673748, 26.0415...
3 '(3, [-8279.58974054, 27.79842822, 26.2223...
4 '(4, [-8272.83682085, 37.0350707 , 24.7959...
... ...
1,322,286,736 '(2792371, [-6781.56835851, 2229.30828904, -6...
1,322,286,737 '(2792372, [-6781.71119626, 2228.78749838, -6...
1,322,286,738 '(2792373, [-6779.3251589 , 2227.46826613, -6...
1,322,286,739 '(2792374, [-6777.26078082, 2229.49535808, -6...
1,322,286,740 '(2792375, [-6782.81758335, 2228.87820639, -6...
This is what I'm getting. The shape is (1322286741, 1) and only column is 'table'. When i try to call the vaex imported hdf as galacto[0]:
[(0, [-8274.59152784, 36.05384262, 24.76688728])]
In pandas imported data these are x,y,z columns for the first row. When i tried to inspect the data in another problem, it also gave an error saying no data has found. So i think the problem is pandas appending hdf5 files row by row and it doesn't work in other programs. Is there a way i can fix this issue?
hdf5 is as flexible as say JSON and xml, in that you can store data in any way you want. Vaex has its own way of storing the data (you can check with the h5ls utils the structure, it's very simple) that does not align with how Pandas/PyTables stores it.
Vaex stores each column as a single contiguous array, which is optimal if you don't work with all columns, and makes it easy to memory map to a (real) numpy array. PyTables stores each row (at least of the same type) next to each other. Meaning if you would calculate the mean of the x
columns, you effectively go over all the data.
Since PyTables hdf5 is probably already much faster to read than CSV, I suggest you do the following (not tested, but it should get the point across):
import vaex
import pandas as pd
import glob
# make sure dir vaex exists
for filename in glob.glob("pandas/*.hdf5"): # assuming your files live there
pdf = pd.read_hdf(filename)
df = vaex.from_pandas(pdf) # now df is a vaex dataframe
df.export(filename.replace("pandas", "vaex"), progress=True)) # same in vaex' format
df = vaex.open("vaex/*.hdf5") # it will be concatenated
# don't access df.x.values since it's not a 'real' numpy array, but
# a lazily concatenated column, so it would need to memory copy.
# If you need that, you can optionally do (and for extra performance)
# df.export("big.hdf5", progress=True)
# df_single = vaex.open("big.hdf5")
# df_single.x.values # this should reference the original data on disk (no mem copy)