Search code examples
pythoncsvdata-analysis

Python: write CSV with matching data from multiple CSV files


I need to match data from multiple CSV files. I wrote a script that works with simple data, but it is slow to analize 4000 rows. I have already tried with set(a) & set(b) but can't return the matching data from every file. The output file must have the matching data from all files.

Script:

    for file_1 in files:
        with open(file_1, 'rt') as f1,open(saved_file, 'w') as f3:
            reader1 = csv.reader(f1, delimiter = ';')
            writer = csv.writer(f3, delimiter = ';', lineterminator = '\n')

            for row1 in reader1:
                for row1 in reader1:
                    for file_2 in files:
                        with open(file_2, 'rt') as f2:
                            reader2 = csv.reader(f2, delimiter = ';')
                            if row1 in reader2:
                                writer.writerow(row1)

The data I am trying to match looks like this:

File_1:

May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
May 18, 2017;12,590.06;12,608.19;12,634.26;12,489.95;123.48M;-0.33%
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%

File_2:

May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
May 16, 2017;1.1082;1.0975;1.1098;1.0971;0.97%
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%

Output: saved_file_1:

May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%

saved_file_2:

May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%

Solution

  • Without resorting to Pandas you can do this, which may be along the lines you were thinking of.

    First go through each file, just collecting dates in separate lists. Then find the intersection of these lists, treated as sets. Now go through each of the files again, writing out each record whose date is in the intersection set.

    def get_dates(one_file):
        one_file_dates = []
        with open(one_file) as the_file:
            for line in the_file.readlines():
                the_date = line[:line.find(';')]
                if not the_date in one_file_dates:
                    one_file_dates.append(the_date) 
        return one_file_dates
    
    common_dates = set(get_dates('file_1.csv')).intersection(set(get_dates('file_2.csv')))
    
    print ('*** processing file_1')
    with open('file_1.csv') as the_file:
        for line in the_file.readlines():
            if line[:line.find(';')] in common_dates:
                print(line.strip())
    
    print ('*** processing file_2')
    with open('file_2.csv') as the_file:
        for line in the_file.readlines():
            if line[:line.find(';')] in common_dates:
                print(line.strip())
    

    Result:

    *** processing file_1
    May 22, 2017;12,615.50;12,650.50;12,665.00;12,567.00;-;-0.18%
    May 19, 2017;12,638.69;12,612.30;12,658.55;12,596.72;121.95M;0.39%
    May 17, 2017;12,631.61;12,700.12;12,786.89;12,587.45;108.95M;-1.35%
    May 15, 2017;12,807.04;12,824.05;12,832.29;12,729.49;87.08M;0.29%
    *** processing file_2
    May 22, 2017;1.1238;1.1200;1.1265;1.1160;0.28%
    May 19, 2017;1.1207;1.1100;1.1214;1.1094;0.94%
    May 17, 2017;1.1159;1.1082;1.1163;1.1078;0.69%
    May 15, 2017;1.0975;1.0924;1.0991;1.0920;0.40%
    

    Edit: new code in response to comment.

    def get_dates(one_file):
        one_file_dates = []
        with open(one_file) as the_file:
            for line in the_file.readlines():
                the_date = line[:line.find(';')]
                if not the_date in one_file_dates:
                    one_file_dates.append(the_date) 
        return one_file_dates
    
    file_list = ['file_1.csv', 'file_2.csv'] # add more file names here
    
    common_dates = set(get_dates(file_list[0]))
    for file in file_list[1:]:
        common_dates = common_dates.intersection(set(get_dates(file)))
    
    for file in file_list:
        print ('*** processing ', file)
        with open(file) as the_file:
            for line in the_file.readlines():
                if line[:line.find(';')] in common_dates:
                    print(line.strip())