Search code examples
pythondictionaryconcatenation

Concatenate all files that map values in the same key


I have a dictionnary that group different pattern :

dico_cluster={'cluster_1': ['CUX2', 'CUX1'], 'cluster_2': ['RFX3', 'RFX2'],'cluster_3': ['REST']}

Then I have files in a folder :

"/path/to/test/files/CUX1.txt"
"/path/to/test/files/CUX2.txt"
"/path/to/test/files/RFX3.txt"
"/path/to/test/files/RFX2.txt"
"/path/to/test/files/REST.txt"
"/path/to/test/files/ZEB.txt"
"/path/to/test/files/TEST.txt"

I'm trying to concatenate the files that are in the same cluster. The output file name should be the name of pattern join by underscore "_"

I tried this :

filenames = glob.glob('/path/to/test/files/*.txt')

for clee in dico_cluster.keys():
    fname='_'.join(dico_cluster[clee])
    outfilename ='/path/to/test/outfiles/'+ fname + ".txt"
    for file in filenames:
        tf_file=file.split('/')[-1].split('.')[0]
        if tf_file in dico_cluster[clee]:
            with open(outfilename, 'wb') as outfile:
                for filename in filenames:
                    if filename == outfilename:
            # don't want to copy the output into the output
                        continue
                    with open(filename, 'rb') as readfile:
                        shutil.copyfileobj(readfile, outfile) 

But it's not working. I'm just concatenating all the files. I want to cat the file that are in the same cluster.


Solution

  • I would recommend to use os package, it's easier to use.

    If I understood your problem I would try to do this by loading the whole content of your files before writing it.

    import os
    for clee in dico_cluster.keys():
            my_clusters =list(set(dico_cluster[clee]))
            fname = "_".join(my_clusters)
            data = list()
            outfilename = os.path.join("/path/to/test/outfiles", fname + ".txt")
            for file in filenames:
                tmp_dict = dict()
                tf_file = os.path.basename(file).split(".")[0]
                if tf_file in my_clusters:
                    with open(file, 'rb') as f1:
                        data.extend([elm for elm in f1.readlines()])
    
            with open(outfilename, "wb") as _output_file:
                for elm in data:
                    _output_file.write(elm)