Search code examples
pythonhdf5large-datavaex

Plot large data with vaex


I've been struggling to create a plot of a csv with millions of lines. I am trying to use the vaex module but I'm stuck..

import vaex

# converts and reads large csv into hdf5 format
df = vaex.open("mydir/cov2.csv",  convert='hdf5')
df.head()

Output

#   chr     pos cov index    
<i style='opacity: 0.6'>0</i>   NC_024468.2 1.34986e+08 6   0     
<i style='opacity: 0.6'>1</i>   NC_024468.2 1.34986e+08 6   1       
<i style='opacity: 0.6'>2</i>   NC_024468.2 1.34986e+08 6   2

The csv is converted to hdf5 and loaded, but there now 2 indexes, 1 with a weird HTML formatting. When I try to plot it like in the documentation and and the solution benchmarked in this thread:

df.plot_widget(df.pos, df.cov)    

I get a value error.

ValueError: <bound method DataFrame.cov of      
#          chr          pos        cov    index      
0          NC_024468.2  134986302  6      0       
1          NC_024468.2  134986303  6      1      
...        ...          ...        ...    ...      
2,704,117  NC_024468.2  137690419  0      2704117        
2,704,118  NC_024468.2  137690420  0      2704118 > is not of string or Expression type, but <class 'method'>

The solution was to change to df.col.cov or df["cov"]. Still, now I get an empty output from the plot_widget method:

  PlotTemplate(components={'main-widget': VBox(children=(VBox(children=(Figure(axes=[Axis(color='#666', grid_col…

  Plot2dDefault(w=None, what='count(*)', x='pos', y='cov', z=None)

Can anyone help me?

Kind regards, Ricardo

EDIT

A sample of the csv data. Column pos increases by 1 on every row (137 Million) and cov is almost always 0, but goes to 1-400 in some areas:

chr,pos,cov
NC_024468.2,1,0
NC_024468.2,2,0
NC_024468.2,3,0
.....
NC_024468.2,137690418,7
NC_024468.2,137690419,6
NC_024468.2,137690420,6

Solution

  • There are many issues here: