Problem: line shuffle a T terabyte text file containing n lines (same line can appear multiple times in the text file) given Z terabytes of RAM, where T = Z * 100. Quasi-shuffling is fine.
Presently I'm using this Python implementation, which performs a quasi-shuffle, but it's somewhat slow. The algorithm is O(n) so I believe the slowness is caused by Python. I was thinking about re-implementing it in C but before doing that I was wondering if anyone knew of an existing solution.
Things that DO NOT work: GNU shuf (loads entire file to be shuffled in memory), GNU sort -R (hashes each line and so output identical lines adjacently).
I solved the problem with the following C++ implementation that is significantly faster: https://github.com/alexandres/terashuf