Search code examples
pythonfilesortingbackuppacking

Programmatically packing 2TB of various sized files into folders of 25GB? (I used python, any language will be acceptable)


I have around 4,000 files of wildly different sizes that I am trying to back up as efficiently as is reasonably possible. I know compressing them all into a giant tarball and splitting evenly is a solution, but as I am using Bluray discs, if I scratch one section, I risk losing the whole disc's contents.

I wrote a python script to put all the files (coupled with their sizes) into an array. I take the biggest file first, and either add the next biggest (if that total is still less than 25GB) or move down the list until there is one I can add that will, until I hit the size limit, then start over with the next biggest remaining file.

This works reasonably well, but it gets really ragged at the end and I will end up using 15 more discs than is mathematically theoretically required.

Anyone have a better method I'm not aware of? (This seems like a Google coding interview question lol). I don't need it to be perfect, I just want to make sure I'm not doing this stupidly before I run through this giant stack of non-cheap BD-Rs. I've included by code for reference.

#!/usr/bin/env python3
import os
import sys

# Max size per disc
pmax = 25000000000

# Walk dir
walkdir = os.path.realpath(sys.argv[1])
flist = []
for root, directories, filenames in os.walk( walkdir ):
    for filename in filenames:
        f = os.path.join(root,filename)
        fsize = os.path.getsize(f)
        flist.append((fsize,f))
flist.sort()
flist.reverse()

running_total = 0
running_list = []
groups = []

while flist :
    for pair in flist :
        if running_total + pair[0] < pmax :
            running_list.append(pair[1])
            running_total = running_total + pair[0]
            flist.remove(pair)
    groups.append(l)
    running_list = []
    running_total = 0
print('This will take {} discs.'.format(len(groups)))

Solution

  • I just brute forced it by going down the list, adding smaller and smaller files until I was out of files or the disk filled up, then repeat. By "mathematically required" I just meant size of all files / 25GB = ideal # of discs. I can post the resulting arrays on