Search code examples
hadoophivedata-warehousedimensional-modelingstar-schema

Creating Star Schema (Dimensions and Fact Table) in Hadoop Hive from a single csv file


Im completely new to Datawarehouse, OLAP and hive. I do have a single csv file containing training data about a online market like ebay (see Column Data). My task is to create a star schema in hive (via Data Analytics Studio, Spark, or whatever).

I scaffolded the dimension as followed:

Fact and Dimensions

How can i create those dimensions tables and especially generate a new ID for each line? My SELECT New_Guid(), listing_title, listing_subtitle, listing_type_code, start_price, buy_it_now_price, buy_it_now_listed_flag, qty_available_per_listing From auctions, which i found on another tutorial. But New_Guid() does not work at all in my Data Analytics Studio.

Thanks a lot!


Solution

  • Assuming no history in the dimensions, and leaving aside if good dimension design or not:

    1. For each required Dimension:

      • read the csv and extract relevant fields with distinct applied to temp_table
      • add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table & persist to dimension_table
    2. For the fact table:

      • read the csv and extract relevant fields with applied to temp_table
      • add a sequence number to each row using select (row_number() over()), Col1, Col2, col3, col4 from temp_table and store in another temp_table_2
      • JOIN with appropriate statement from temp_table_2 to dimensuions looking up / selecting the sequence number giving temp_table_3
      • insert the keys only from temp_table_3 to fact_table