Search code examples
apache-sparkdataframepysparktranspose

How to use incremental data to create dataframes in pyspark


I have some tables in hive. These tables get data appended incrementally to them.

Now I have created a data frame in pyspark using a table in hive today. I have done a transpose on the data frame and created another table with the new transposed data frame in hive.

Say tomorrow I get new incremental data in hive table of 100 new rows. Now I want to use only these 100 new rows an create a new data frame and do a transpose and append to the existing transposed hive table.

How can I achieve that using pyspark.


Solution

  • The semantics in Hive in itself are not enough to provide this functionality. The data either has to be identifiable via content, file, or metadata process.

    Identifiable by content: The data contains a time or date stamp which allows you to create a query against the table, but filter out only those rows that are of interest.

    Identifiable by file: Skip the Hive interface and attempt to locate the data on HDFS/POSIX using the Modify or Change timesteamps on individual files, for example. Load the file directly as a new dataframe.

    Identifiable by metadata process: In the architecture I've built, I use Apache NiFi, Kafka and Cloudera Navigator to provide metadata lineage regarding file and data ingestion. If your architecture contains metadata about ingested data, you may be able to leverage that to identify the files/records you need.