Search code examples
csvmiller

Split huge CSV by columns with Miller


I need to split huge (>1 Gb) CSV files containing 50K+ columns each on a daily basis.

I've found Miller as an interesting and performant tool for such a task.

But I'm stuck on Miller's documentation.

How could I split one CSV to N smaller CSV files where N is a number of rows in my source file?


Solution

  • try with this script

    mlr --csv put -S 'if (NR % 10000 == 0) {$rule=NR} else {$rule = ""}' \
    then fill-down -f rule \
    then put -S 'if ($rule=="") {$rule="0"}' \
    then put -q 'tee > $rule.".csv", $*' input.csv
    

    Make a copy of your CSV in a new folder, and then run this script on it. It will produce a csv file for every 10000 rows.