Search code examples
pythonpandasrows

Python - Combining Columns in a CSV file


I'm trying to create code that will take data form certain columns in a CSV file and combine them into a new CSV file. I was directed to use Pandas but I'm not sure if I'm even on the right track. I'm fairly new to Python so prepare yourselves for potentially awful code.

I want to use data.csv:

Customer_ID,Date,Time,OtherColumns,A,B,C,Cost
1003,January,2:00,Stuff,1,5,2,519
1003,January,2:00,Stuff,1,3,2,530
1003,January,2:00,Stuff,1,3,2,530
1004,Feb,2:00,Stuff,1,1,0,699

and create a new CSV that looks like this:

Customer_ID,ABC
1003,152
1003,132
1003,132
1004,110

What I have so far is:

import csv
import pandas as pd

df = pd.read_csv('test.csv', delimiter = ',')
custID = df.customer_ID
choiceA = df.A
choiceB = df.B
choiceC = df.C

ofile  = open('answer.csv', "wb")
writer = csv.writer(ofile, delimiter = ',')
writer.writerow(custID + choiceA + choiceB + choiceC)

Unfortunately all that does is add each row together, then create a CSV of each row summed together as one row. My true end goal would be to find the most occurring values in columns A-C and combine each customer into the same row, using the most occurring values. I'm awful at explaining. I'd want something that takes data.csv and makes this:

Customer_ID,ABC
1003,132
1004,110

Solution

  • You can sum the columns you're interested in, if their type is string:

    In [11]: df = pd.read_csv('data.csv', index_col='Customer_ID')
    
    In [12]: df
    Out[12]:
                    Date  Time OtherColumns  A  B  C  Cost
    Customer_ID
    1003         January  2:00        Stuff  1  5  2   519
    1003         January  2:00        Stuff  1  3  2   530
    1003         January  2:00        Stuff  1  3  2   530
    1004             Feb  2:00        Stuff  1  1  0   699
    
    In [13]: res = df[list('ABC')].astype(str).sum(1)  # cols = list('ABC')
    
    In [14]: res
    Out[14]:
    Customer_ID
    1003           152
    1003           132
    1003           132
    1004           110
    dtype: float64
    

    To get the csv, you can first use to_frame to add the desired column name:

    In [15]: res.to_frame(name='ABC')  # ''.join(cols)
    Out[15]:
                 ABC
    Customer_ID
    1003         152
    1003         132
    1003         132
    1004         110
    
    In [16]: res.to_frame(name='ABC').to_csv('new.csv')