I'm trying to load a very large dataset (~560MB) into a dict in order to display it as a 3D graph. I have run into memory issues that resulted in "Killed.", so I've added some logic to read my dataset in chunks and dump the dict to a json file periodically, I hoped this would avoid my RAM being filled up. However, I still get it killed after reaching about 4.00M/558.0M progress.
I want to understand how this roughly 560MB file is costing me gigabytes of RAM just to cut away unwanted columns and transform into a dict? And if there's any more efficient methods to get to what I need: a data object where I can efficiently extract sets of coords with their vals.
Please find my code and some example data below:
import json
import logging
import os
import pandas as pd
from tqdm import tqdm
def create_grid_dict(file_path, chunk_size=500000):
"""
:param file_path: Path to a grid file.
:param chunk_size: Number of lines to process before dumping into json
:return: Dictionary object containing the gist grid data with as index the voxel number
and as values the x, y and z coordinates, and the value
"""
# Read the data from the file
with open(file_path, 'r') as file:
# Read the first line
header = file.readline().strip()
header2 = file.readline().strip()
# Log the header
logging.info(header)
columns = header2.split(' ')
# Get the file size
file_size = os.path.getsize(file_path)
output_file = 'datasets/cache.json'
# Check if the output file already exists
if os.path.exists(output_file):
with open(output_file, 'r') as f:
grid_dict = json.load(f)
return grid_dict
else:
# Create an empty dictionary to store the grid data
grid_dict = {}
logging.info(f"Reading file size {file_size} in chunks of {chunk_size} lines.")
# Read the file in chunks
with tqdm(total=file_size, unit='B', unit_scale=True, desc="Processing") as pbar:
for chunk in pd.read_csv(file_path, delim_whitespace=True, skiprows=2, names=columns, chunksize=chunk_size):
# Filter out the columns you need
chunk = chunk[['voxel', 'xcoord', 'ycoord', 'zcoord', 'val1', 'val2']]
# Iterate through each row in the chunk
for index, row in chunk.iterrows():
voxel = row['voxel']
# Store the values in the dictionary
grid_dict[voxel] = {
'xcoord': row['xcoord'],
'ycoord': row['ycoord'],
'zcoord': row['zcoord'],
'val': row['val1'] + 2 * row['val2']
}
pbar.update(chunk_size)
# Write the grid dictionary to the output file after processing each chunk
with open(output_file, 'w') as f:
json.dump(grid_dict, f)
return grid_dict
# Example space-delimited dataset
voxel xcoord ycoord zcoord val1 val2
1 0.1 0.2 0.3 10 5
2 0.2 0.3 0.4 8 4
3 0.3 0.4 0.5 12 6
4 0.4 0.5 0.6 15 7
5 0.5 0.6 0.7 9 3
6 0.6 0.7 0.8 11 5
7 0.7 0.8 0.9 13 6
8 0.8 0.9 1.0 14 7
9 0.9 1.0 1.1 16 8
10 1.0 1.1 1.2 18 9
You're not just making a dict
, you're making a dict
of dict
s, each of which incurs a non-trivial amount of overhead (on my machine, each of your small dict
s adds 184 bytes of overhead all by itself). Your chunking is nigh unto useless, because the dict
of dict
s keeps growing, the (likely fairly efficiently stored) dataframe is likely not all that important, memory-wise.
One thing you could try to do to reduce that overhead is to have your dict
map to slotted classesinstead of dict
s, and use dataclasses.asdict
to make them encode to the dict
form:
import dataclasses
import json
@dataclasses.dataclass(slots=True) # slots=True optimizes to store only the declared attributes with less memory
class Voxel:
xcoord: float
ycoord: float
zcoord: float
val: int
then change the relevant part of your loop to:
for index, row in chunk.iterrows():
voxel = row['voxel']
# Store the values in the dict
grid_dict[voxel] = Voxel(row['xcoord'],
row['ycoord'],
row['zcoord'],
row['val1'] + 2 * row['val2']
)
pbar.update(chunk_size)
# Write the grid dictionary to the output file after processing each chunk
with open(output_file, 'w') as f:
json.dump(grid_dict, f, default=dataclasses.asdict) # Converts to dict lazily just in time to convert it to a string, rather than all at once
dataclasses
both make the custom class quicker to declare, and provide the dataclasses.asdict
helper to make serializing them easy (collections.namedtuple
/typing.NamedTuple
might seem like they work, but unfortunately, they're interpreted as plain tuple
s and serialized as JSON Arrays).
The net savings here, on my local Python install, is 120 bytes per Voxel
(the dict
overhead is 184 bytes per-dict
, vs. 64 bytes per-Voxel
instance). If you're close to being able to fit in memory, this might be enough to get you there.