Search code examples
pythonfilecsvrecords

How to search for the common records of one csv file to another csv file without using for loop


What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB column that also contain 1 in the HH column, and all uncommon records that contain 0 in HH column.

As a result, all files will contain common records in BB column that contain 1 in the HH column, and all uncommon records that has no records in BB column and contain 0 in the HH column. The file name should be based on the common record of column 2 (BB). Please take a look below for the scenario.

bigFile.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5     960     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

My expected results files woud be as follows:

53.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

59.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

61.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5    960      1
   12             68b       95     3       5    334      0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4    215      0
   12             a10       36     5       1             0      

Solution

  • The following script will produce your expected outcome. It filters the file based on the contents of the first row:

    import csv
    
    with open('123.csv', 'r') as f_input, open('output.csv', 'w', newline='') as f_output:
        csv_input = csv.reader(f_input)
        csv_output = csv.writer(f_output)
    
        header = next(csv_input)
        first_row = next(csv_input)
    
        csv_output.writerow(header)
        csv_output.writerow(first_row)
    
        for row in csv_input:
            if row[0] == first_row[0]:
                if row[1] == first_row[1] or row[7] == '0':
                    csv_output.writerow(row)
    

    To do this operation on all CSV files in the same folder, the following could be used:

    import glob             
    import csv
    import os
    
    file_list = glob.glob('*.csv')
    
    output_folder = 'temp'  # make sure this folder exists
    
    for file in file_list:
        file_name = os.path.join(output_folder, os.path.basename(file))
        with open(file, 'r') as f_input, open(file_name, 'w', newline='') as f_output:
            csv_input = csv.reader(f_input)
            csv_output = csv.writer(f_output)
    
            header = next(csv_input)
            first_row = next(csv_input)
    
            csv_output.writerow(header)
            csv_output.writerow(first_row)
    
            for row in csv_input:
                if row[0] == first_row[0]:
                    if row[1] == first_row[1] or row[7] == '0':
                        csv_output.writerow(row)
    

    This will write all of the output files to a temp folder.