I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)
You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy
to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()