Search code examples
csvselectrecords

Select records based on the specific index string value and then remove subsequent fields by python


I have a .csv file named file01.csv that contains many records. Some records are required and some are not. I find that the required records has a string variable “Mi”, but it is not exist into the unnecessary records. So, I want to select the required records based on string value “Mi” in the field for every records.

Finally I want to delete the subsequent fields of each record from the field that contains value “Mi”. Any suggestion and advice is appreciated.

Optional:

  1. In addition, I want to delete the first column.

  2. Split column BB into two column named as a_id, and c_id. Separate the value by _ (underscore) and left side will go to a_id, and right side will go to c_id.

My fileO.csv is as follows:

   AA      BB       CC       DD     EE      FF    GG
   1       1_1.csv  (=0      =10"   27"     =57   "Mi"
   0.97    0.9      0.8      NaN    0.9     od    0.2
   2       1_3.csv  (=0      =10"   27"     "Mi"  0.5
   0.97    0.5      0.8      NaN    0.9     od    0.4
   3       1_6.csv  (=0      =10"   "Mi"     =53  cnt
   0.97    0.9      0.8      NaN    0.9     od    0.6
   4       2_6.csv  No Bi    000    000     000   000
   5       2_8.csv  No Bi    000    000     000   000
   6       6_9.csv  less     000    000     000   000
   7       7_9.csv  s(=0     =26"   =46"    "Mi"  121     

My Expected results files (outFile.csv):

a_id    b_id    CC    DD    EE    FF    GG             
1       1       0     10    27    57              
1       3       0     10    27
1       6       0     10 
7       9       0     26    46  

Solution

  • The following approach should work fine using Python csv module:

    import csv
    import re
    import string
    
    output_header = ['a_id', 'b_id', 'CC', 'DD', 'EE', 'FF', 'GG']
    
    sanitise_table = string.maketrans("","")
    nodigits_table = sanitise_table.translate(sanitise_table, string.digits)
    
    def find_mi(row):
        for index, col in enumerate(row):
            if col.find('Mi') != -1:
                return index
        return -1
    
    def sanitise_cell(cell):
        return cell.translate(sanitise_table, nodigits_table)       # Keep digits
    
    f_input = open('fileO.csv', 'rb')
    f_output = open('outFile.csv', 'wb')
    
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    
    input_header = next(f_input)
    csv_output.writerow(output_header)
    
    for row in csv_input:
        #print '%2d  %s' % (len(row), row)  
        if len(row) >= 2:
            bb = re.match(r'(\d+)__(\d+).0\.csv', row[1])
            mi = find_mi(row)
    
            if bb and mi != -1:
                row[:] = row[:mi] + [''] * (len(row) - mi)
                row[:] = [sanitise_cell(col) for col in row]
                row[0] = bb.group(1)
                row[1] = bb.group(2)
                csv_output.writerow(row)
    
    f_input.close()
    f_output.close()
    

    outFile.csv will contain the following:

    a_id,b_id,CC,DD,EE,FF,GG
    1,1,0,10,27,57,
    1,3,0,10,27,,
    1,6,0,10,,,
    7,9,0,26,46,,
    

    Tested using Python 2.6.6