I have two CSV files. They have a same column but each of rows in the same column are not unique, like this:
gpo_full.csv:
Date hearing_sub_type topic Specific_Date
January,1997 Oversight weather January 12,1997
June,2000 General life June 5,2000
January,1997 General forest January 1,1997
April,2001 Oversight people NaN
June,2000 Oversight depressed June 6,2000
January,1997 General weather January 1,1997
June,2000 Oversight depressed June 5,2000
CAP_cols.csv:
majortopic id Chamber topic Date Specific_Date
21 79846 1 many forest January,1997 January 1,1997
4 79847 2 emotion June,2000 June 6,2000
13 79848 1 NaN May,2001 NaN
7 79849 2 good life June,2000 June 5,2000
21 79850 1 good weather January,1997 January 1,1997
25 79851 1 rain & cloudy January,1997 January 12,1997
6 79852 2 sad & depressed June,2000 June 5,2000
I want to use three criteria to match these data: Specific_Date, Date, and topic.
First, I want to use "Date" column to group these data. Next, I try to use "Specific_Date" column to narrow down the scope since some data are lost in this column. Finally, I want to use "topic" column by similar words like word-embedding to make sure which rows in gpo_full can be corresponding with a unique row in CAP_cols.
I have tried to use "Date" column to group the data and merge them into JSON file. However, I am trapped in achieving the next step to narrow down the scope by specific date and topic.
My thought for this output would be like:
{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
"hearing_sub_type": "Oversight",
and other columns
}
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
"hearing_sub_type": "General",
and other columns
}
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
and similar for others}
I have been thinking for three days and have no idea. Any idea for achieving this would be very helpful! Greatly appreciated!
There's a couple of issues with the matching of topics, so you'll need to expand the match_topic()
method I used, but I added some logic to see what didn't match at the end.
The results
variable contains a list of dict which you can easily save as a JSON file.
Check the inline comments for the reasoning of the logic I used.
Sidenote:
I would slightly restructure the JSON if I were you. Putting the topic
as a key/value pair under the GPO
and CAP
keys makes more sense to me than having a Topic
key with a separate GPO
and CAP
key/value pair...
import csv
from pprint import pprint
import json
# load gpo_full.csv into a list of dict using
# csv.DictReader & list comprehension
with open("path/to/file/gpo_full.csv") as infile:
gpo_full = [item for item in csv.DictReader(infile)]
# do the same for CAP_cols.csv
with open("path/to/file/CAP_cols.csv") as infile:
cap_cols = [item for item in csv.DictReader(infile)]
def match_topic(gpo_topic: str, cap_topic: str) -> bool:
"""We need a function as some of the mapping is not simple
Args:
gpo_topic (str): gpo topic
cap_topic (str): CAP topic
Returns:
bool: True if topics match
"""
# this one is simple
if gpo_topic in cap_topic:
return True
# you need to repeat the below conditional check
# for each custom topic matching
elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
return True
# example secondary topic matching
elif gpo_topic == "foo" and cap_topic == "bar":
return True
# finally return false for no matches
return False
# we need this later
gpo_length = len(gpo_full)
results = []
cap_left_over = []
# do the actual mapping
# this could've been done above, but I separated it intentionally
for cap in cap_cols:
found = False
# first find the corresponding gpo
for index, gpo in enumerate(gpo_full):
if (
gpo["Specific_Date"] == cap["Specific_Date"] # check by date
and match_topic(gpo["topic"], cap["topic"]) # check if topics match
):
results.append({
"Date": gpo["Date"],
"Specific_Date": gpo["Specific_Date"],
"Topic": {
"GPO": gpo["topic"],
"CAP": cap["topic"]
},
"GPO": {
"hearing_sub_type": gpo["hearing_sub_type"]
},
"CAP": {
"majortopic": cap["majortopic"],
"id": cap["id"],
"Chamber": cap["Chamber"]
}
})
# pop & break to remove the gpo item
# this is so you're left over with a list of
# gpo items that didn't match
# it also speeds up further matches
gpo_full.pop(index)
found = True
break
# this is to check if there's stuff left over
if not found:
cap_left_over.append(cap)
with open('path/to/file/combined_json.json', 'w') as outfile:
json.dump(results, outfile, indent=4)
pprint(results)
print(f'\nLength:\n Results: {len(results)}\n CAP: {len(cap)}\n GPO: {gpo_length}')
print('\nLeftover GPO:')
pprint(gpo_full)
print('\nLeftover CAP:')
pprint(cap_left_over)
OUTPUT
I've removed the pprint(results)
from the output, see the JSON further down
Length:
Results: 5
CAP: 6
GPO: 7
Leftover GPO:
[{'Date': 'April,2001',
'Specific_Date': 'NaN ',
'hearing_sub_type': 'Oversight',
'topic': 'people'},
{'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'hearing_sub_type': 'Oversight',
'topic': 'depressed'}]
Leftover CAP:
[{'Chamber': '2',
'Date': 'June,2000',
'Specific_Date': 'June 6,2000',
'id': '79847',
'majortopic': '4',
'topic': 'emotion'},
{'Chamber': '1',
'Date': 'May,2001',
'Specific_Date': 'NaN',
'id': '79848',
'majortopic': '13',
'topic': 'NaN'}]
path/to/file/gpo_full.csv
Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,1997",General,forest,"January 1,1997"
"April,2001",Oversight,people,NaN
"June,2000",Oversight,depressed,"June 6,2000"
"January,1997",General,weather,"January 1,1997"
"June,2000",Oversight,depressed,"June 5,2000"
path/to/file/CAP_cols.csv
majortopic,id,Chamber,topic,Date,Specific_Date
21,79846,1,many forest,"January,1997","January 1,1997"
4,79847,2,emotion,"June,2000","June 6,2000"
13,79848,1,NaN,"May,2001","NaN"
7,79849,2,good life,"June,2000","June 5,2000"
21,79850,1,good weather,"January,1997","January 1,1997"
25,79851,1,rain & cloudy,"January,1997","January 12,1997"
6,79852,2,sad & depressed,"June,2000","June 5,2000"
path/to/file/combined_json.json
[
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "forest",
"CAP": "many forest"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79846",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "life",
"CAP": "good life"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "7",
"id": "79849",
"Chamber": "2"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997",
"Topic": {
"GPO": "weather",
"CAP": "good weather"
},
"GPO": {
"hearing_sub_type": "General"
},
"CAP": {
"majortopic": "21",
"id": "79850",
"Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 12,1997",
"Topic": {
"GPO": "weather",
"CAP": "rain & cloudy"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "25",
"id": "79851",
"Chamber": "1"
}
},
{
"Date": "June,2000",
"Specific_Date": "June 5,2000",
"Topic": {
"GPO": "depressed",
"CAP": "sad & depressed"
},
"GPO": {
"hearing_sub_type": "Oversight"
},
"CAP": {
"majortopic": "6",
"id": "79852",
"Chamber": "2"
}
}
]