I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.
I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.
I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.
Could someone give me some hints? Thank you.
EDIT : forgot to say that I know the number of lines in my files beforehand.
You can use a heapq
to select n records based on a random number, eg:
import heapq
import random
SIZE = 10
with open('yourfile') as fin:
sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())
This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE
elements in memory at once.