Search code examples
pythonjsoncsvmergeword-embedding

Merging Two CSV Files Based on Many Criteria


I have two CSV files. They have a same column but each of rows in the same column are not unique, like this:

gpo_full.csv:
    Date           hearing_sub_type   topic      Specific_Date
    January,1997   Oversight          weather    January 12,1997
    June,2000      General            life       June 5,2000
    January,1997   General            forest     January 1,1997
    April,2001     Oversight          people     NaN 
    June,2000      Oversight          depressed  June 6,2000
    January,1997   General            weather    January 1,1997
    June,2000      Oversight          depressed  June 5,2000

CAP_cols.csv:
    majortopic   id     Chamber   topic           Date           Specific_Date
    21           79846  1         many forest     January,1997   January 1,1997
    4            79847  2         emotion         June,2000      June 6,2000
    13           79848  1         NaN             May,2001       NaN
    7            79849  2         good life       June,2000      June 5,2000
    21           79850  1         good weather    January,1997   January 1,1997
    25           79851  1         rain & cloudy   January,1997   January 12,1997
    6            79852  2         sad & depressed June,2000      June 5,2000

I want to use three criteria to match these data: Specific_Date, Date, and topic.
First, I want to use "Date" column to group these data. Next, I try to use "Specific_Date" column to narrow down the scope since some data are lost in this column. Finally, I want to use "topic" column by similar words like word-embedding to make sure which rows in gpo_full can be corresponding with a unique row in CAP_cols.
I have tried to use "Date" column to group the data and merge them into JSON file. However, I am trapped in achieving the next step to narrow down the scope by specific date and topic.
My thought for this output would be like:

{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
    "hearing_sub_type": "Oversight",
    and other columns
}
"CAP": {
    "majortopic": "25",
    "id": "79851",
    "Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
    "hearing_sub_type": "General",
    and other columns
}
"CAP": {
    "majortopic": "21",
    "id": "79846",
    "Chamber": "1"
}
and similar for others}

I have been thinking for three days and have no idea. Any idea for achieving this would be very helpful! Greatly appreciated!


Solution

  • There's a couple of issues with the matching of topics, so you'll need to expand the match_topic() method I used, but I added some logic to see what didn't match at the end.

    The results variable contains a list of dict which you can easily save as a JSON file.

    Check the inline comments for the reasoning of the logic I used.

    Sidenote:

    I would slightly restructure the JSON if I were you. Putting the topic as a key/value pair under the GPO and CAP keys makes more sense to me than having a Topic key with a separate GPO and CAP key/value pair...

    import csv
    from pprint import pprint
    import json
    
    
    # load gpo_full.csv into a list of dict using
    # csv.DictReader & list comprehension
    with open("path/to/file/gpo_full.csv") as infile:
        gpo_full = [item for item in csv.DictReader(infile)]
    
    
    # do the same for CAP_cols.csv
    with open("path/to/file/CAP_cols.csv") as infile:
        cap_cols = [item for item in csv.DictReader(infile)]
    
    
    def match_topic(gpo_topic: str, cap_topic: str) -> bool:
        """We need a function as some of the mapping is not simple
    
        Args:
            gpo_topic (str): gpo topic
            cap_topic (str): CAP topic
    
        Returns:
            bool: True if topics match
        """
        # this one is simple
        if gpo_topic in cap_topic:
            return True
        # you need to repeat the below conditional check
        # for each custom topic matching
        elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
            return True 
        # example secondary topic matching
        elif gpo_topic == "foo" and cap_topic == "bar":
            return True 
        # finally return false for no matches
        return False
    
    
    # we need this later
    gpo_length = len(gpo_full)
    results = []
    cap_left_over = []
    # do the actual mapping
    # this could've been done above, but I separated it intentionally
    for cap in cap_cols:
        found = False
        # first find the corresponding gpo
        for index, gpo in enumerate(gpo_full):
            if (
                gpo["Specific_Date"] == cap["Specific_Date"] # check by date
                and match_topic(gpo["topic"], cap["topic"]) # check if topics match
            ):
                results.append({
                    "Date": gpo["Date"],
                    "Specific_Date": gpo["Specific_Date"],
                    "Topic": {
                        "GPO": gpo["topic"],
                        "CAP": cap["topic"]
                    },
                    "GPO": {
                        "hearing_sub_type": gpo["hearing_sub_type"]
                    },
                    "CAP": {
                        "majortopic": cap["majortopic"],
                        "id": cap["id"],
                        "Chamber": cap["Chamber"]
                    }
                })
                # pop & break to remove the gpo item
                # this is so you're left over with a list of
                # gpo items that didn't match
                # it also speeds up further matches
                gpo_full.pop(index)
                found = True
                break
        # this is to check if there's stuff left over
        if not found:
            cap_left_over.append(cap)
    
    
    with open('path/to/file/combined_json.json', 'w') as outfile:
        json.dump(results, outfile, indent=4)
    
    
    pprint(results)
    print(f'\nLength:\n  Results: {len(results)}\n  CAP: {len(cap)}\n  GPO: {gpo_length}')
    print('\nLeftover GPO:')
    pprint(gpo_full)
    print('\nLeftover CAP:')
    pprint(cap_left_over)
    

    OUTPUT
    I've removed the pprint(results) from the output, see the JSON further down

    Length:
      Results: 5
      CAP: 6
      GPO: 7
    
    Leftover GPO:
    [{'Date': 'April,2001',
      'Specific_Date': 'NaN ',
      'hearing_sub_type': 'Oversight',
      'topic': 'people'},
     {'Date': 'June,2000',
      'Specific_Date': 'June 6,2000',
      'hearing_sub_type': 'Oversight',
      'topic': 'depressed'}]
    
    Leftover CAP:
    [{'Chamber': '2',
      'Date': 'June,2000',
      'Specific_Date': 'June 6,2000',
      'id': '79847',
      'majortopic': '4',
      'topic': 'emotion'},
     {'Chamber': '1',
      'Date': 'May,2001',
      'Specific_Date': 'NaN',
      'id': '79848',
      'majortopic': '13',
      'topic': 'NaN'}]
    

    path/to/file/gpo_full.csv

    Date,hearing_sub_type,topic,Specific_Date
    "January,1997",Oversight,weather,"January 12,1997"
    "June,2000",General,life,"June 5,2000"
    "January,1997",General,forest,"January 1,1997"
    "April,2001",Oversight,people,NaN 
    "June,2000",Oversight,depressed,"June 6,2000"
    "January,1997",General,weather,"January 1,1997"
    "June,2000",Oversight,depressed,"June 5,2000"
    

    path/to/file/CAP_cols.csv

    majortopic,id,Chamber,topic,Date,Specific_Date
    21,79846,1,many forest,"January,1997","January 1,1997"
    4,79847,2,emotion,"June,2000","June 6,2000"
    13,79848,1,NaN,"May,2001","NaN"
    7,79849,2,good life,"June,2000","June 5,2000"
    21,79850,1,good weather,"January,1997","January 1,1997"
    25,79851,1,rain & cloudy,"January,1997","January 12,1997"
    6,79852,2,sad & depressed,"June,2000","June 5,2000"
    

    path/to/file/combined_json.json

    [
        {
            "Date": "January,1997",
            "Specific_Date": "January 1,1997",
            "Topic": {
                "GPO": "forest",
                "CAP": "many forest"
            },
            "GPO": {
                "hearing_sub_type": "General"
            },
            "CAP": {
                "majortopic": "21",
                "id": "79846",
                "Chamber": "1"
            }
        },
        {
            "Date": "June,2000",
            "Specific_Date": "June 5,2000",
            "Topic": {
                "GPO": "life",
                "CAP": "good life"
            },
            "GPO": {
                "hearing_sub_type": "General"
            },
            "CAP": {
                "majortopic": "7",
                "id": "79849",
                "Chamber": "2"
            }
        },
        {
            "Date": "January,1997",
            "Specific_Date": "January 1,1997",
            "Topic": {
                "GPO": "weather",
                "CAP": "good weather"
            },
            "GPO": {
                "hearing_sub_type": "General"
            },
            "CAP": {
                "majortopic": "21",
                "id": "79850",
                "Chamber": "1"
            }
        },
        {
            "Date": "January,1997",
            "Specific_Date": "January 12,1997",
            "Topic": {
                "GPO": "weather",
                "CAP": "rain & cloudy"
            },
            "GPO": {
                "hearing_sub_type": "Oversight"
            },
            "CAP": {
                "majortopic": "25",
                "id": "79851",
                "Chamber": "1"
            }
        },
        {
            "Date": "June,2000",
            "Specific_Date": "June 5,2000",
            "Topic": {
                "GPO": "depressed",
                "CAP": "sad & depressed"
            },
            "GPO": {
                "hearing_sub_type": "Oversight"
            },
            "CAP": {
                "majortopic": "6",
                "id": "79852",
                "Chamber": "2"
            }
        }
    ]