Search code examples
pythondictionarymemorymemory-management

Python data to dict memory concerns: how to efficiently load data into a dict?


I'm trying to load a very large dataset (~560MB) into a dict in order to display it as a 3D graph. I have run into memory issues that resulted in "Killed.", so I've added some logic to read my dataset in chunks and dump the dict to a json file periodically, I hoped this would avoid my RAM being filled up. However, I still get it killed after reaching about 4.00M/558.0M progress.

I want to understand how this roughly 560MB file is costing me gigabytes of RAM just to cut away unwanted columns and transform into a dict? And if there's any more efficient methods to get to what I need: a data object where I can efficiently extract sets of coords with their vals.

Please find my code and some example data below:

import json
import logging
import os

import pandas as pd
from tqdm import tqdm


def create_grid_dict(file_path, chunk_size=500000):
    """
    :param file_path: Path to a grid file.
    :param chunk_size: Number of lines to process before dumping into json
    :return: Dictionary object containing the gist grid data with as index the voxel number
             and as values the x, y and z coordinates, and the value
    """
    # Read the data from the file
    with open(file_path, 'r') as file:
        # Read the first line
        header = file.readline().strip()
        header2 = file.readline().strip()
        # Log the header
        logging.info(header)
    columns = header2.split(' ')

    # Get the file size
    file_size = os.path.getsize(file_path)

    output_file = 'datasets/cache.json'
    # Check if the output file already exists
    if os.path.exists(output_file):
        with open(output_file, 'r') as f:
            grid_dict = json.load(f)
            return grid_dict
    else:
        # Create an empty dictionary to store the grid data
        grid_dict = {}

    logging.info(f"Reading file size {file_size} in chunks of {chunk_size} lines.")
    # Read the file in chunks
    with tqdm(total=file_size, unit='B', unit_scale=True, desc="Processing") as pbar:
        for chunk in pd.read_csv(file_path, delim_whitespace=True, skiprows=2, names=columns, chunksize=chunk_size):
            # Filter out the columns you need
            chunk = chunk[['voxel', 'xcoord', 'ycoord', 'zcoord', 'val1', 'val2']]

            # Iterate through each row in the chunk
            for index, row in chunk.iterrows():
                voxel = row['voxel']
                # Store the values in the dictionary
                grid_dict[voxel] = {
                    'xcoord': row['xcoord'],
                    'ycoord': row['ycoord'],
                    'zcoord': row['zcoord'],
                    'val': row['val1'] + 2 * row['val2']
                }
            pbar.update(chunk_size)

            # Write the grid dictionary to the output file after processing each chunk
            with open(output_file, 'w') as f:
                json.dump(grid_dict, f)
    return grid_dict
# Example space-delimited dataset
voxel xcoord ycoord zcoord val1 val2
1 0.1 0.2 0.3 10 5
2 0.2 0.3 0.4 8 4
3 0.3 0.4 0.5 12 6
4 0.4 0.5 0.6 15 7
5 0.5 0.6 0.7 9 3
6 0.6 0.7 0.8 11 5
7 0.7 0.8 0.9 13 6
8 0.8 0.9 1.0 14 7
9 0.9 1.0 1.1 16 8
10 1.0 1.1 1.2 18 9

Solution

  • You're not just making a dict, you're making a dict of dicts, each of which incurs a non-trivial amount of overhead (on my machine, each of your small dicts adds 184 bytes of overhead all by itself). Your chunking is nigh unto useless, because the dict of dicts keeps growing, the (likely fairly efficiently stored) dataframe is likely not all that important, memory-wise.

    One thing you could try to do to reduce that overhead is to have your dict map to slotted classesinstead of dicts, and use dataclasses.asdict to make them encode to the dict form:

    import dataclasses
    import json
    
    @dataclasses.dataclass(slots=True)  # slots=True optimizes to store only the declared attributes with less memory
    class Voxel:
        xcoord: float
        ycoord: float
        zcoord: float
        val: int
    

    then change the relevant part of your loop to:

                for index, row in chunk.iterrows():
                    voxel = row['voxel']
                    # Store the values in the dict
                    grid_dict[voxel] = Voxel(row['xcoord'],
                                             row['ycoord'],
                                             row['zcoord'],
                                             row['val1'] + 2 * row['val2']
                                            )
                pbar.update(chunk_size)
    
                # Write the grid dictionary to the output file after processing each chunk
                with open(output_file, 'w') as f:
                    json.dump(grid_dict, f, default=dataclasses.asdict)  # Converts to dict lazily just in time to convert it to a string, rather than all at once
    

    dataclasses both make the custom class quicker to declare, and provide the dataclasses.asdict helper to make serializing them easy (collections.namedtuple/typing.NamedTuple might seem like they work, but unfortunately, they're interpreted as plain tuples and serialized as JSON Arrays).

    The net savings here, on my local Python install, is 120 bytes per Voxel (the dict overhead is 184 bytes per-dict, vs. 64 bytes per-Voxel instance). If you're close to being able to fit in memory, this might be enough to get you there.