I have a rather large (1.3 GB, unzipped) csv file, with 2 dense columns and 1.4 K sparse columns, about 1 M rows.
I need to make a pandas.DataFrame from it.
For small files I can simply do:
df = pd.read_csv('file.csv')
For the large file I have now, I get a memory error, clearly due to the DataFrame size (tested by sys.getsizeof(df)
Based on this document:
https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating
it looks like I can make a DataFrame with mixed dense and sparse columns.
However, I can only see instructions to add individual sparse columns, not a chunk of them all together, from the csv file.
Reading the csv sparse columns one by one and adding them to df using:
for colname_i in names_of_sparse_columns:
data = pd.read_csv('file.csv', usecols = [colname_i])
df[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
works, and df
stays very small, as desired, but the execution time is absurdly long.
I tried of course:
pd.read_csv(path_to_input_csv, usecols = names_of_sparse_columns, dtype = "Sparse[float]")
but that generates this error:
NotImplementedError: Extension Array: <class 'pandas.core.arrays.sparse.array.SparseArray'> must implement _from_sequence_of_strings in order to be used in parser methods
Any idea how I can do this more efficiently?
I checked several posts, but they all seem to be after something slightly different from this.
EDIT adding a small example, to clarify
import numpy as np
import pandas as pd
import sys
# Create an unpivoted sparse dataset
lengths = list(np.random.randint(low = 1, high = 5, size = 10000))
cols = []
for l in lengths:
cols.extend(list(np.random.choice(100, size = l, replace = False)))
rows = np.repeat(np.arange(10000), lengths)
vals = np.repeat(1, sum(lengths))
df_unpivoted = pd.DataFrame({"row" : rows, "col" : cols, "val" : vals})
# Pivot and save to a csv file
df = df_unpivoted.pivot(index = "row", columns = "col", values = "val")
df.to_csv("sparse.csv", index = False)
This file occupies 1 MB on my PC.
Instead:
sys.getsizeof(df)
# 8080016
This looks like 8 MB to me.
So there is clearly a large increase in size when making a pd.DataFrame
from a sparse csv file (in this case I made the file from the data frame, but it's the same as reading in the csv file using pd.read_csv()
).
And this is my point: I cannot use pd.read_csv()
to load the whole csv file into memory.
Here it's only 8 MB, that's no problem at all; with the actual 1.3 GB csv I referred to, it goes to such a huge size that it crashes our machine's memory.
I guess it's easy to try that, by replacing 10000 with 1000000 and 100 with 1500 in the above simulation.
If I do instead:
names_of_sparse_columns = df.columns.values
df_sparse = pd.DataFrame()
for colname_i in names_of_sparse_columns:
data = pd.read_csv('sparse.csv', usecols = [colname_i])
df_sparse[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])
The resulting object is much smaller:
sys.getsizeof(df_sparse)
# 416700
In fact even smaller than the file.
And this is my second point: doing this column-by-column addition of sparse columns is very slow.
I was looking for advice on how to make
df_sparse
from a file like"sparse.csv"
faster / more efficiently.
In fact, while I was writing this example, I noticed that:
sys.getsizeof(df_unpivoted)
# 399504
So maybe the solution could be to read the csv file line by line and unpivot it. The rest of the handling I need to do however would still require that I write out a pivoted csv, so back to square one.
EDIT 2 more information
Just as well that I describe the rest of the handling I need to do, too.
When I can use a non-sparse data frame, there is an ID
column in the file:
df["ID"] = list(np.random.choice(20, df.shape[0]))
I need to make a summary of how many data exist, per ID
, per data column:
df.groupby("ID").count()
The unfortunate bit is that the sparse data frame does not support this.
I found a workaround, but it's very inefficient and slow.
If anyone can advise on that aspect, too, it would be useful.
I would have guessed there would be a way to load the sparse part of the csv into some form of sparse array, and make a summary by ID
.
Maybe I'm approaching this completely the wrong way, and that's why I am asking this large competent audience for advice.
I don't have the faintest idea why someone would have made a CSV in that format. I would just read it in as chunks and fix the chunks.
# Read in chunks of data, melt it into an dataframe that makes sense
data = [c.melt(id_vars=dense_columns, var_name="Column_label", value_name="Thing").dropna()
for c in pd.read_csv('file.csv', iterator=True, chunksize=100000)]
# Concat the data together
data = pd.concat(data, axis=0)
Change the chunksize and the name of the value column as needed. You could also read in chunks and turn the chunks into a sparse dataframe if needed, but it seems that you'd be better off with a melted dataframe for what you want to do, IMO.
You can always chunk it again going the other way as well. Change the number of chunks as needed for your data.
with open('out_file.csv', mode='w') as out:
for i, chunk in enumerate(np.array_split(df, 100)):
chunk.iloc[:, 2:] = chunk.iloc[:, 2:].sparse.to_dense()
chunk.to_csv(out, header=i==0)