Search code examples
palantir-foundryfoundry-code-repositories

How can I create an empty dataset from on a PySpark schema in Palantir Foundry?


I have a PySpark schema that describes columns and their types for a dataset (which I could write by hand, or get from an existing dataset by going to the 'Columns' tab, then 'Copy PySpark schema').

I want an empty dataset with this schema, for example that could be used as a backing dataset for a writeback-only ontology object. How can I create this in Foundry?


Solution

  • To do this in Python, you can create an empty dataset by using the Spark session from the context to create a DataFrame with the schema, for example:

    from pyspark.sql import types as T
    from transforms.api import transform_df, configure, Output
    
    SCHEMA = T.StructType([
        T.StructField('entity_name', T.StringType()),
        T.StructField('thing_value', T.IntegerType()),
        T.StructField('created_at', T.TimestampType()),
    ])
    
    
    # Given there is no work to do, save on compute by running it on the driver
    @configure(profile=["KUBERNETES_NO_EXECUTORS_SMALL"])
    @transform_df(
        Output("/some/dataset/path/or/rid"),
    )
    def compute(ctx):
        return ctx.spark_session.createDataFrame([], schema=SCHEMA)