Add columns to a CSV file from existing CSV files

I have multiple different csv files that each represent a single feature. I need to create a new csv file that takes each of these individual files and maps them to a new column in the new file.

I have tried something like this:

import csv

file = open("House_2_X.csv", 'a')

writer = csv.writer(file)
with open("House_2_TS copy/channel_1(TimeStamp).csv") as f:
    for line in f:
        
        with open("House_2_BOTH copy/channel_2(BOTH).csv", 'r') as f1:
            for line1 in f1:
                a = line1[12:]
                line1 = a[:4]
                writer.writerow([line, line1])

                                                                                                     
file.close()

but this does not work. Any suggestions?

Solution

It's best to use pandas. Since you did not show an example file, assuming each file has a single column, something like this could work.

cols = [pd.read_csv(f, squeeze=True) for f in file_paths]
df = pd.concat([cols], axis=1)
df.to_csv("newfile.csv")

EDIT:

Since there has been a comment about large files, here's an approach with dask.

With 3 CSVs that look like this:

$ for f in *.csv; do cat $f; echo; done
abc
def
ghi

ABC
DEF
GHI

a_b_c
d_e_f
g_h_i

The following:

import dask.dataframe as dd

df = dd.read_csv("*.csv", header=None).squeeze()
df = dd.concat(list(df.partitions), axis=1) # see note below
df.compute() # this should return you to pandas

Gives me this:

     0    0      0
0  abc  ABC  a_b_c
1  def  DEF  d_e_f
2  ghi  GHI  g_h_i

You can scale this to a cluster. However, if you cannot deploy a cluster, I will repeat my comment below, you should look at a solution using Apache Arrow.

NOTE: the concat step will generate a UserWarning. To make sure you can ignore it, you have to do some ground work and ensure all your CSVs align (as in, same number of rows. If that's not the case, you need to pre-process them before merging.

Any help related to either the pre-processing or the Arrow should be separate questions.

EDIT2:

Here's an alternate pandas based approach more amenable to large files:

chunked_readers = [
    pd.read_csv(f, chunksize=<size>, header=None, squeeze=True)
    for f in file_paths
]
for i, dfs in enumerate(zip(*chunked_readers):
    df = pd.concat(dfs, axis=1)
    df.to_csv(f"merged_{i}.csv")

Choose the chunksize as per your resource constraints. To be sure not to run out of memory, you could also call gc.collect() within the loop.