I have three tables and I want to join them together one by one by some common columns and the end results always have massive amount of null values in them although the data is present (in the second left join operation) in both tables.
Here is the code I ran:
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
Below is the end results. As you can see, column unit_nav, returnrate and value which are from the last table are mostly empty
For example, when fundinnercode = 4082, datestamp = 2010-06-30 (highlighted in the picture), unit_nav, returnrate and value are all null. but this row is present in the table t_fund_nav_ss as shown below:
what is even weirder is when I use the where clause to specifically locate the row in the end results, the three missing columns appear to have data in them
SELECT T1.INNERCODE AS CODE,
T2.DATESTAMP,
T1.REPORTDATE,
t1.FUNDINNERCODE,
T1.MARKETVALUE,
T1.SHARESHOLDING,
T1.RATIOINNV,
T3.UNIT_NAV,
T3.VALUE,
T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
LEFT JOIN (SELECT CALENDAR_DATE AS CALENDAR_DATE,
CLOSEST_DATETIME_BEFORE AS DATESTAMP
FROM APP_RQA.T_CALENDAR_SS
WHERE EXCHMARKET = '83') T2
ON T1.REPORTDATE = T2.CALENDAR_DATE
LEFT JOIN (SELECT CODE,
DATESTAMP,
UNIT_NAV,
VALUE,
RETURNRATE
FROM APP_RQA.T_FUND_NAV_SS) T3
ON T1.FUNDINNERCODE = T3.CODE
AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
AND T2.DATESTAMP = '2010-06-30'
I could not wrap my head around it and any help or suggestions are greatly appreciated.
Thanks to everybody for commenting. I stumbled across a big data engineer today and asked him about the issue. For those who are curious, it turns out that the issue is about the size of T3 table. T1 and T2 have around 20-30 k rows whereas T3 has 30 mil. And the left join operation causes data loss. So I first filter T3 to around 2 mil rows and results look normal now.