Search code examples
sqlhadoophivehql

Left Join in Hive Produces Peculiar Results


I have three tables and I want to join them together one by one by some common columns and the end results always have massive amount of null values in them although the data is present (in the second left join operation) in both tables.

Here is the code I ran:

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP

Below is the end results. As you can see, column unit_nav, returnrate and value which are from the last table are mostly empty enter image description here

For example, when fundinnercode = 4082, datestamp = 2010-06-30 (highlighted in the picture), unit_nav, returnrate and value are all null. but this row is present in the table t_fund_nav_ss as shown below: enter image description here

what is even weirder is when I use the where clause to specifically locate the row in the end results, the three missing columns appear to have data in them

SELECT T1.INNERCODE                     AS CODE,
       T2.DATESTAMP,
       T1.REPORTDATE,
       t1.FUNDINNERCODE,
       T1.MARKETVALUE,
       T1.SHARESHOLDING,
       T1.RATIOINNV,
       T3.UNIT_NAV,
       T3.VALUE,
       T3.RETURNRATE
FROM JUYUAN_DB.MF_FUNDPORTIFOLIODETAIL T1
         LEFT JOIN (SELECT CALENDAR_DATE           AS CALENDAR_DATE,
                           CLOSEST_DATETIME_BEFORE AS DATESTAMP
                    FROM APP_RQA.T_CALENDAR_SS
                    WHERE EXCHMARKET = '83') T2
                   ON T1.REPORTDATE = T2.CALENDAR_DATE
         LEFT JOIN (SELECT CODE,
                           DATESTAMP,
                           UNIT_NAV,
                           VALUE,
                           RETURNRATE
                    FROM APP_RQA.T_FUND_NAV_SS) T3
                   ON T1.FUNDINNERCODE = T3.CODE
                       AND T2.DATESTAMP = T3.DATESTAMP
WHERE T1.FUNDINNERCODE = 4082
  AND T2.DATESTAMP = '2010-06-30'

enter image description here

I could not wrap my head around it and any help or suggestions are greatly appreciated.


Solution

  • Thanks to everybody for commenting. I stumbled across a big data engineer today and asked him about the issue. For those who are curious, it turns out that the issue is about the size of T3 table. T1 and T2 have around 20-30 k rows whereas T3 has 30 mil. And the left join operation causes data loss. So I first filter T3 to around 2 mil rows and results look normal now.