Search code examples
pythonjsonout-of-memorygenerator

Python: make a list generator JSON serializable


How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.

My fist try was to use jq, but it looks like jq -s is not optimized for a large input.

jq -s -r '[.[][]]' *.js 

This command works, but takes way too long to complete and I really would like to solve this with Python.

Here is my current code:

def concatFiles(outName, inFileNames):
    def listGenerator():
        for inName in inFileNames:
            with open(inName, 'r') as f:
                for item in json.load(f):
                    yield item

    with open(outName, 'w') as f:
        json.dump(listGenerator(), f)

I'm getting:

TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable

Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?


Solution

  • You should derive from list and override __iter__ method.

    import json
    
    def gen():
        yield 20
        yield 30
        yield 40
    
    class StreamArray(list):
        def __iter__(self):
            return gen()
    
        # according to the comment below
        def __len__(self):
            return 1
    
    a = [1,2,3]
    b = StreamArray()
    
    print(json.dumps([1,a,b]))
    

    Result is [1, [1, 2, 3], [20, 30, 40]].