Search code examples
sqlhiveparquetimpala

Create an Impala text table where rows meet a condition


I am trying to create a table in Impala (SQL) that takes rows from a parquet table. The data represents bike rides in a city. Rows will be imported into the new table if there starting code (a string, ex: '6100') shows up more than 100 times in the first table. Heres what I have so far:

#I am using Apache Impala via the Hue Editor

invalidate metadata;
set compression_codec=none;

invalidate metadata;
Set compression_codec=gzip;

create table bixirides_parquet (
start_date string, start_station_code string, 
end_date string, end_station_code string, 
duration_sec int, is_member int)
stored as parquet;


Insert overwrite table bixirides_parquet select * from bixirides_avro;


invalidate metadata;
set compression_codec=none;

create table impala_out stored as textfile as select start_date, start_station_code, end_date, end_station_code, duration_sec, is_member, count(start_station_code) as count
from bixirides_parquet
having count(start_station_code)>100;

For some reason the statement will run, but no rows are inserted in the new table. It should import a row into the new table if that rows starting code shows up more than 100 times in the original table. I think I'm wording my select statement improperly but I'm not sure how exactly.


Solution

  • I think the final query you want is:

    select start_date, start_station_code, end_date, 
           end_station_code, duration_sec, is_member, cnt
    from (select bp.*,
                 count(*) over (partition by start_station_code) as cnt
          from bixirides_parquet bp
         ) bp
    where cnt > 100;