I have a bunch of large datasets.
DS_1 (contains all unique IDs and names):
ID Name
1 Apple
2 Banana
3 Cherry
DS_2:
ID Observation
1 Apple detail
1 Apple detail
1 Apple detail
2 Banana detail
2 Banana detail
3 Cherry detail
3 Cherry detail
3 Cherry detail
DS_3:
ID Observation
2 Banana detail
2 Banana detail
3 Cherry detail
I'm looking to create a new dataset that shows frequency counts across the datasets (and lastly calculates Total_Obs). I would output something like this:
ID Name DS_2 DS_3 Total_Obs
1 Apple 3 0 3
2 Banana 2 2 4
3 Cherry 3 1 4
The datasets are fairly large. Is there a more efficient way to do this apart from concatenating the datasets and doing a frequency table? Or having to creating a bunch of sorted frequency tables, then merging by ID across all the datasets?
You can do below -
Select t1.id As id
,t1.name As name
,coalesce(DS_2_obs,0) as DS_2_obs
,coalesce(DS_3_obs,0) as DS_3_obs
,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs
from DS_1 t1
left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2
on t1.id = t2.id
left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3
on t1.id = t3.id;
Also, you should always tag which Database you are using.
If the above SQL takes a lot of time, instead of t2 and t3 as inline queries you can consider creating aggregate observation tables with frequency/counts and having an index on id. That way when you join the observation aggregates with the primary table, the join can be faster based on indexes.