Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.

The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.

The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.

What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.

The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!

import itertools
import pandas as pd

# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)]) 

# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
    counter = 0
    total_lines = sum(1 for line in f)
    f.seek(0)
    for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column 
        sc16, pubyr = key
        rows = [row.strip().split(';') for row in group]
        columns = rows[0]
        rows = rows[1:]
        # Step 3: convert the group of rows to a dataframe
        group_df = pd.DataFrame(rows, columns=columns)

        # Step 4: calculate the mean for the group
        mean_row = {'sc16': sc16, 'pubyr': pubyr}
        for col in group_df.columns:
            if col.startswith('dim_'):
                mean_row[col] = group_df[col].astype(float).mean()

        # Step 5: append the mean row to the mean dataframe
        mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
        counter += len(rows)
        print(f"{counter} of {total_lines}")

# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)

Solution

You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.

Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)

Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.

import itertools
import operator

import numpy as np


def read_embeddings_file(filename):
    # Read the (pre-sorted) embeddings file,
    # yielding tuples of ((sc16, pubyr) and a list of dimensions).
    with open(filename) as in_file:
        for line in in_file:
            if not line or line.startswith("sc16"):  # Header or empty line
                continue
            line = line.split(";")
            sc16, cpid, type, pubyr, *dims = line
            # list(map(... is faster than the equivalent listcomp
            yield (sc16, pubyr), list(map(int, dims))


def main():
    output_name = "scibert_embeddings_mean.txt"
    input_name = "scibert_embeddings_sorted.txt"
    with open(output_name, "w") as out_f:
        print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
        counter = 0
        for group, group_contents in itertools.groupby(
            read_embeddings_file(input_name),
            key=operator.itemgetter(0),  # Group by (sc16, pubyr)
        ):
            dims = [d[1] for d in group_contents]
            # Calculate the mean of each dimension
            mean_dims = np.mean(np.array(dims).astype(float), axis=0)
            # Write group to output
            print(*group, *mean_dims, sep=";", file=out_f)
            # Print progress
            counter += len(dims)
            print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")


if __name__ == "__main__":
    main()