Search code examples
hivehiveqlhadoop2

hive counting same field many times


I need to count how many students are from which college but when i am using below query

select college ,COUNT(*) from students group by college ;

i am getting this result

enter image description here

result showing different counts for same colleges what should i do here so i can get the proper count of colleges


Solution

  • It seems like you have many different names of the same college, like these

    JIIT
    "JIIT
    jiit
    

    Try to normalize them (convert to uppercase and remove '"'), so it will be the same JIIT after group by:

     select case when college = 'BSA' then 'BSA College of Technology'
            --add other cases
            else --rule for others
                trim(upper(regexp_replace(college,'"',''))) 
             end as college 
           ,COUNT(*)                                    as cnt 
       from students 
      group by 
            case when college = 'BSA' then 'BSA College of Technology'
            --add other cases
            else --rule for others
                trim(upper(regexp_replace(college,'"',''))) 
             end --the same sentence should be in group by, or use subquery instead
    ;
    

    Apply case to convert more complex strings like MJP ROHILKHAND and M J P ROHILKHAND to the same.

    And this happens because the database is not normalized and input is not restricted for College column by College dimension.