Search code examples
pythoncsv

Compare two CSV files and output lines not found in first file into a third file in Python


I have two CSV files which are file1.csv, and file2.csv. Both have multiple columns as follows:

a. file1.csv

username,user id,access hash,name,group,group id
SreyTey1998,963229606,7854138709318981862,Smaradey Chan,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Srey_Tey_1,2079816779,6921382059939144796,Srey tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
sreytey123,5316691604,668712126044928206,Phat SreyTey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Sreytey168,5455045488,-714912998136226691,Vong Soksreytey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
SreyTey99,5653783510,-2575791274366210473,Oun Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
sreytey1919,5819100400,3174041461521242292,Tey Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Sreytey6666,6001252515,1586106578669001327,Srey Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
SreyTey7777,6026179841,5596849859821333867,Srey Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Ahh_Nak86,5637888996,-1267155033181296023,Yìì Ng,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461

b. file2.csv

username,user id,access hash,name,group,group id
SreyTey1998,963229606,7854138709318981862,Smaradey Chan,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Srey_Tey_1,2079816779,6921382059939144796,Srey tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
sreytey123,5316691604,668712126044928206,Phat SreyTey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
Sreytey168,5455045488,-714912998136226691,Vong Soksreytey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
SreyTey99,5653783510,-2575791274366210473,Oun Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
sreytey1919,5819100400,3174041461521242292,Tey Tey,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
AhhLyn1213,808888756,2482753619838480608,Ly-លី🌈â¤ï¸,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahhly09,938983724,-8302570306911018211,方塔莉,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahh_vong,873218908,1743989214734522713,Mek Sreyvong,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahhnitaccd,5420585351,-6331445989210603589,NITA CCD,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461

c. The output file as file2-nodups.csv, should be:

username,user id,access hash,name,group,group id
AhhLyn1213,808888756,2482753619838480608,Ly-លី🌈â¤ï¸,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahhly09,938983724,-8302570306911018211,方塔莉,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahh_vong,873218908,1743989214734522713,Mek Sreyvong,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
ahhnitaccd,5420585351,-6331445989210603589,NITA CCD,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461

I have tried the following codes:

with open('file1.csv', 'r', encoding="utf8") as t1:
    fileone = t1.readlines()
with open('file2.csv', 'r', encoding="utf8") as t2:
    filetwo = t2.readlines()

# scans through the two files and writes differences to new csv
with open('file2-nodups.csv', 'w', encoding="utf8") as outFile:
    for line in filetwo:
        if line not in fileone:
            outFile.write(line)
                

The above does not work - because the output file (file2-nodups.csv) has the same content as file2.csv

Very appreciate any advice.

I would like to confirm that the above codes are Working for the following data:

    file1.csv:
    username,user id,access hash,name,group,group id
    asgie2,19933,29kd982hi4hh6h443,
    47uuha,491920,kdsagku5kkajgjag,
    james_sing,4002899,4asg37yragdh300asgdlk,
    joe_naro,4989222,hgjhe84jkaglagjj,
    48700245,hlvkiiwej8njnnrk320kc,

    file2.csv:
    username,user id,access hash,name,group,group id
    misschue,87340a,hgeikka83llagea,
    james_sing,4002899,4asg37yragdh300asgdlk,
    michell22,4883140,cn2ukkfhiigakgd3yhg,

    Output file:
    username,user id,access hash,name,group,group id
    misschue,87340a,hgeikka83llagea,
    michell22,4883140,cn2ukkfhiigakgd3yhg,

But they dont work for data with more columns above. Very appreciate any advice.

UPDATE: Python does not read strange characters correctly; that is why all suggested codes dont work. Those characters appear in the fourth column, for example,

ž‰ ឆាំងឣីុ 💗11 ""H1""",Zisy Ly បោះដុំនឹងលក់រាយកម្មងáŸ

After replacing them with English or numbers or space characters, the codes work as expected.

The best solution to this is compare one column (second) which is unique to every line, and if its different, output those lines in file2.csv to the third file.

Here are my files:

    a. file1.csv

    username,user id,access hash,name,group,group id
    Manithbun,892557579,-8406919653747059917,Manith Bun.,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1128046905,-541590019988886083,M.r rithy phone shope ♥ï¸â™¥ï¸â™¥ï¸,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    PET_KIMHEANG,1209296259,-3836506661281845754,áž–áŸáž គឹមហ៊ាង៚,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1215736274,-3314364090375867636,"ម៉ាញ ឆាំងឣីុ 💗11 ""H1""",Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1296059051,289341865810496004,កូន ប៉ាប៉ា ក្រុមឡានបែនសំបូស្នáŸáž áŸ,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    oun_Rorng,1330317717,-1815409784220647876,Mss Rorng,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1373051922,4933577293541799006,Soroth Phe,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    Veasnalove9999,1721055020,-7406583441966710751,Veasna Love,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461


    b. file2.csv

    username,user id,access hash,name,group,group id
    Manithbun,892557579,-8406919653747059917,Manith Bun.,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1128046905,-541590019988886083,M.r rithy phone shope ♥ï¸â™¥ï¸â™¥ï¸,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    PET_KIMHEANG,1209296259,-3836506661281845754,áž–áŸáž គឹមហ៊ាង៚,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1215736274,-3314364090375867636,"ម៉ាញ ឆាំងឣីុ 💗11 ""H1""",Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1296059051,289341865810496004,កូន ប៉ាប៉ា ក្រុមឡានបែនសំបូស្នáŸáž áŸ,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    oun_Rorng,1330317717,-1815409784220647876,Mss Rorng,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,1373051922,4933577293541799006,Soroth Phe,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    Veasnalove9999,1721055020,-7406583441966710751,Veasna Love,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    Thithseda,5185517950,-6420126539020087321,Thith Seda,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,5708325040,7960869157828131294,Tong Meang,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    takoma_05,396559485,4653963679316835279,V P,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461


    c. nodups.csv (the lines in file2.csv which are not found in file1.csv)

    Thithseda,5185517950,-6420126539020087321,Thith Seda,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    ,5708325040,7960869157828131294,Tong Meang,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461
    takoma_05,396559485,4653963679316835279,V P,Zisy Ly បោះដុំនឹងលក់រាយកម្មង់ពីរោងចក្រផ្ទាល់📥លáŸážážœáŸážšáž›áž»áž™0967699965,1806798461

I have tried the following codes, but nothing is written to the output file:

      import csv
      reader1 = csv.reader(open("file1.csv", 'r', encoding="utf8"))
      row1 = next(reader1)
      reader2 = csv.reader(open("file2.csv", 'r', encoding="utf8"))
      row2 = next (reader2)

      with open("nodups.csv", 'w', encoding="utf8") as f:
           for row in reader2:
                if (row1[1] != row2[1]):
                     f.write(row)

Please advise how to make that work. Very appreciate any help.


Solution

  • RESOLVED: Compare two CSV files based on second column (user id) and print out the differences found in second file to third file:

        with open('file1.csv', 'r', encoding="utf8") as check_file:
            check_set = set([row.split(',')[1] for row in check_file])
    
        with open('file2.csv', 'r', encoding="utf8") as filetwo, open('file3.csv', 'w', encoding="utf8") as out_file:
            for line in filetwo:
                if line.split(',')[1] not in check_set:
                    out_file.write(line)
    

    Or use the followings with headers:

        with open('file1.csv', 'r', encoding="utf8") as f1:
            fileone = set([row.split(',')[1] for row in f1]) # Retrieve data in 2nd column for all rows in first CSV file
    
        with open('file2.csv', 'r', encoding="utf8") as filetwo, open('nodups.csv', 'w', encoding="utf8") as out_file:
            out_file.write('username,user id,access hash,name,group,group id' + '\n')
            for line in filetwo:   # Scan each line in second CSV file
                if line.split(',')[1] not in fileone: # If the data of that line is not found in first file, write it into third file
                    out_file.write(line)