Line shuffling multi-terabyte text file

Problem: line shuffle a T terabyte text file containing n lines (same line can appear multiple times in the text file) given Z terabytes of RAM, where T = Z * 100. Quasi-shuffling is fine.

Presently I'm using this Python implementation, which performs a quasi-shuffle, but it's somewhat slow. The algorithm is O(n) so I believe the slowness is caused by Python. I was thinking about re-implementing it in C but before doing that I was wondering if anyone knew of an existing solution.

Things that DO NOT work: GNU shuf (loads entire file to be shuffled in memory), GNU sort -R (hashes each line and so output identical lines adjacently).

Solution

I solved the problem with the following C++ implementation that is significantly faster: https://github.com/alexandres/terashuf

How to get the shape of a xarray dataset by using dims labels
Generating new SQLite database django
Remove background text and noise from an image using image processing with OpenCV
ImportError : No module named graphics
Python TypeError: 'function' object is not subscriptable
python: when can I unpack a generator?
Creating an index in PyMilvus 2.5.x does not actually index any rows
merging xml files using python's ElementTree
Disable python import sorting in VSCode
TemplateDoesNotExist at /users/register/ bootstrap5/uni_form.html
OpenCV Apriltag detection only detects a few markers
How to convert 2D networkx graph to interactive 3D in python?
Custom Service Account with KFP pipelines in Vertex AI
Can I automate discord actions with python?
Anti-Join Pandas
Batch matrix multiplication in numpy
How to align two plots in Matplotlib
Aligning frames in tkinter python, (customtkinter)
Tkinter Listbox How to tell if an item is selected
python filename.py in command line does not work
Text representation of a list with gaps
How to Unit Test a Python Class Which Needs to Make an API Call to an External Service?
convert multi-index column to single column in dataframe
How to find duplicates in a string
Cannot convert base64 string into image
How can I select the proper openai.api_version?
How to extract text associated with image from pdf?
How to import python file from git submodule
Get last row that satisfies a condition using pandas groupby
Python: sharing common code among a family of scripts