Search code examples
powershelllarge-data

Powershell random shuffle/split large text file


Is there a fast implementation in Powershell to randomly shuffle and split a text file with 15 million rows using a 15%-85% split?

Many sources mention how to do it using Get-Content, but Get-Content and Get-Random is slow for large files:

Get-Content "largeFile.txt" | Sort-Object{Get-Random}| Out-file "shuffled.txt"

I was looking for solutions using Stream-Reader and Stream-Writer, but I'm not sure if it's possible. Linux bash seems to do this extremely fast for my file of 15million: How can I shuffle the lines of a text file on the Unix command line or in a shell script?


Solution

  • I was trying to use stream reader/writer to not blow up my memory usage since some of these files are over 300MB large. I could not find a way to avoid memory completely, but instead of putting the file into memory, I create a random array of numbers between 0 and Total Lines. The array indicates which rows to put into the sample file.

    Create Stream Reader for Data

    $reader = New-Object -TypeName System.IO.StreamReader("data.txt");
    

    Create Stream Writer for Test Population

    $writer_stream = New-Object -TypeName System.IO.FileStream(
        ("test_population.txt"),
        [System.IO.FileMode]::Create,
        [System.IO.FileAccess]::Write);
    $writer= New-Object -TypeName System.IO.StreamWriter(
        $writer_stream,
        [System.Text.Encoding]::ASCII);
    

    Create Stream Writer for Control Group

    $writer_stream_control = New-Object -TypeName System.IO.FileStream(
        ("control.txt"),
        [System.IO.FileMode]::Create,
        [System.IO.FileAccess]::Write);
    $writer_control= New-Object -TypeName System.IO.StreamWriter(
        $writer_stream_control,
        [System.Text.Encoding]::ASCII);
    

    Determine the control size and randomly choose numbers between 0 and the total number of rows in the file.

    $line_count = 10000000
    $control_percent = 0.15
    $control_size = [math]::round($control_percent*$line_count)
    

    Create an index of random numbers to determine which rows should go to sample file. Make sure to pipe through sort at the end.

    $idx = Get-Random -count $control_size -InputObject(0..($line_count-1))|sort -Unique
    

    denote $i as the line number; use $idx[$j] as the row that should go to the sample file

    $i = 0; $j = 0
    while ($reader.Peek() -ge 0) {    
        $line = $reader.ReadLine() #Read Line
        if ($idx[$j] -eq $i){
            $writer_control.WriteLine($OutPut)
            $j++
            }
        else{$writer.WriteLine($OutPut)}
        }
        $i++
    
    $reader.Close();
    $reader.Dispose();
    
    $writer.Flush();
    $writer.Close();
    $writer.Dispose();
    
    $writer_control.Flush();
    $writer_control.Close();
    $writer_control.Dispose();