Search code examples
sqlcsvunixlarge-data-volumeslarge-data

Create a 350000 column csv file by merging smaller csv files



I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.

I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?


Solution

  • This python script will do what you want:

    #!/usr/bin/env python2
    
    import os
    import sys
    import codecs
    
    fhs = []
    count = 0
    for filename in sys.argv[1:]:
        fhs.append(codecs.open(filename,'r','utf-8'))
        count += 1
    
    while count > 0:
        delim = ''
        for fh in fhs:
            line = fh.readline()
            if not line:
                count -= 1
                line = ''
    
            sys.stdout.write(delim)
            delim = ','
            sys.stdout.write(line.rstrip())
        sys.stdout.write('\n')
    
    for fh in fhs:
        fh.close()
    

    Call it with all the CSV files you want to merge and it will print a new file to stdout.

    Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.

    So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.

    Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.