Search code examples
pythoncsvexport-to-csv

Add columns to a CSV file from existing CSV files


I have multiple different csv files that each represent a single feature. I need to create a new csv file that takes each of these individual files and maps them to a new column in the new file.

I have tried something like this:

import csv

file = open("House_2_X.csv", 'a')

writer = csv.writer(file)
with open("House_2_TS copy/channel_1(TimeStamp).csv") as f:
    for line in f:
        
        with open("House_2_BOTH copy/channel_2(BOTH).csv", 'r') as f1:
            for line1 in f1:
                a = line1[12:]
                line1 = a[:4]
                writer.writerow([line, line1])

                                                                                                     
file.close()

but this does not work. Any suggestions?


Solution

  • It's best to use pandas. Since you did not show an example file, assuming each file has a single column, something like this could work.

    cols = [pd.read_csv(f, squeeze=True) for f in file_paths]
    df = pd.concat([cols], axis=1)
    df.to_csv("newfile.csv")
    

    EDIT:

    Since there has been a comment about large files, here's an approach with dask.

    With 3 CSVs that look like this:

    $ for f in *.csv; do cat $f; echo; done
    abc
    def
    ghi
    
    ABC
    DEF
    GHI
    
    a_b_c
    d_e_f
    g_h_i
    

    The following:

    import dask.dataframe as dd
    
    df = dd.read_csv("*.csv", header=None).squeeze()
    df = dd.concat(list(df.partitions), axis=1) # see note below
    df.compute() # this should return you to pandas
    

    Gives me this:

         0    0      0
    0  abc  ABC  a_b_c
    1  def  DEF  d_e_f
    2  ghi  GHI  g_h_i
    

    You can scale this to a cluster. However, if you cannot deploy a cluster, I will repeat my comment below, you should look at a solution using Apache Arrow.

    NOTE: the concat step will generate a UserWarning. To make sure you can ignore it, you have to do some ground work and ensure all your CSVs align (as in, same number of rows. If that's not the case, you need to pre-process them before merging.

    Any help related to either the pre-processing or the Arrow should be separate questions.

    EDIT2:

    Here's an alternate pandas based approach more amenable to large files:

    chunked_readers = [
        pd.read_csv(f, chunksize=<size>, header=None, squeeze=True)
        for f in file_paths
    ]
    for i, dfs in enumerate(zip(*chunked_readers):
        df = pd.concat(dfs, axis=1)
        df.to_csv(f"merged_{i}.csv")
    

    Choose the chunksize as per your resource constraints. To be sure not to run out of memory, you could also call gc.collect() within the loop.