I've been struggling to create a plot of a csv with millions of lines. I am trying to use the vaex module but I'm stuck..
import vaex
# converts and reads large csv into hdf5 format
df = vaex.open("mydir/cov2.csv", convert='hdf5')
df.head()
Output
# chr pos cov index
<i style='opacity: 0.6'>0</i> NC_024468.2 1.34986e+08 6 0
<i style='opacity: 0.6'>1</i> NC_024468.2 1.34986e+08 6 1
<i style='opacity: 0.6'>2</i> NC_024468.2 1.34986e+08 6 2
The csv is converted to hdf5 and loaded, but there now 2 indexes, 1 with a weird HTML formatting. When I try to plot it like in the documentation and and the solution benchmarked in this thread:
df.plot_widget(df.pos, df.cov)
I get a value error.
ValueError: <bound method DataFrame.cov of
# chr pos cov index
0 NC_024468.2 134986302 6 0
1 NC_024468.2 134986303 6 1
... ... ... ... ...
2,704,117 NC_024468.2 137690419 0 2704117
2,704,118 NC_024468.2 137690420 0 2704118 > is not of string or Expression type, but <class 'method'>
The solution was to change to df.col.cov or df["cov"]. Still, now I get an empty output from the plot_widget method:
PlotTemplate(components={'main-widget': VBox(children=(VBox(children=(Figure(axes=[Axis(color='#666', grid_col…
Plot2dDefault(w=None, what='count(*)', x='pos', y='cov', z=None)
Can anyone help me?
Kind regards, Ricardo
A sample of the csv data. Column pos increases by 1 on every row (137 Million) and cov is almost always 0, but goes to 1-400 in some areas:
chr,pos,cov
NC_024468.2,1,0
NC_024468.2,2,0
NC_024468.2,3,0
.....
NC_024468.2,137690418,7
NC_024468.2,137690419,6
NC_024468.2,137690420,6
There are many issues here:
vaex.open('...', convert=True, copy_index=False)
. I opened an issue for that https://github.com/vaexio/vaex/issues/754 to change the default.