I have multiple different csv files that each represent a single feature. I need to create a new csv file that takes each of these individual files and maps them to a new column in the new file.
I have tried something like this:
import csv
file = open("House_2_X.csv", 'a')
writer = csv.writer(file)
with open("House_2_TS copy/channel_1(TimeStamp).csv") as f:
for line in f:
with open("House_2_BOTH copy/channel_2(BOTH).csv", 'r') as f1:
for line1 in f1:
a = line1[12:]
line1 = a[:4]
writer.writerow([line, line1])
file.close()
but this does not work. Any suggestions?
It's best to use pandas
. Since you did not show an example file, assuming each file has a single column, something like this could work.
cols = [pd.read_csv(f, squeeze=True) for f in file_paths]
df = pd.concat([cols], axis=1)
df.to_csv("newfile.csv")
EDIT:
Since there has been a comment about large files, here's an approach with dask
.
With 3 CSVs that look like this:
$ for f in *.csv; do cat $f; echo; done
abc
def
ghi
ABC
DEF
GHI
a_b_c
d_e_f
g_h_i
The following:
import dask.dataframe as dd
df = dd.read_csv("*.csv", header=None).squeeze()
df = dd.concat(list(df.partitions), axis=1) # see note below
df.compute() # this should return you to pandas
Gives me this:
0 0 0
0 abc ABC a_b_c
1 def DEF d_e_f
2 ghi GHI g_h_i
You can scale this to a cluster. However, if you cannot deploy a cluster, I will repeat my comment below, you should look at a solution using Apache Arrow.
NOTE: the concat
step will generate a UserWarning
. To make sure you can ignore it, you have to do some ground work and ensure all your CSVs align (as in, same number of rows. If that's not the case, you need to pre-process them before merging.
Any help related to either the pre-processing or the Arrow should be separate questions.
EDIT2:
Here's an alternate pandas based approach more amenable to large files:
chunked_readers = [
pd.read_csv(f, chunksize=<size>, header=None, squeeze=True)
for f in file_paths
]
for i, dfs in enumerate(zip(*chunked_readers):
df = pd.concat(dfs, axis=1)
df.to_csv(f"merged_{i}.csv")
Choose the chunksize
as per your resource constraints. To be sure not to run out of memory, you could also call gc.collect()
within the loop.