python repository transform docx palantir-foundry

Output .docx document using repository in palantir foundry

Since the foundry documentation is rather patchy and didn't really provide an answer: Is it somehow possible to use a foundry code repository (python-docx library is available and used) and a df as input to produce word documents (.docx) as output? I thought that maybe using a composition of the transform input/output and py-docx document.save() functionality may work but I couldn't come up with a proper solution.

    from pyspark.sql import functions as F
    from transforms.api import transform, transform_df, Input, Output
    import os, docx
    import pandas as pd
    
    @transform(
        output = Output("some_folder/"),
        source_df = Input(""),
    )
    
    def compute(source_df, output):
        df = source_df.dataframe()
        test = df.toPandas()
        document = docx.Document()
        doc.add_paragraph(str(test.loc[1,1])
        document.save('test.docx')
        output.write_dataframe(df)

This code ofc does't work, but would appreciate a working solution (in an ideal world it would be possible to have multiple .docx as output).

Solution

Your best bet is to use spark to distribute the file generation over executors. This transformation generates a word doc for each row and stores in a dataset container, which is recommended over using Compass (Foundry's folder system). Browse to the dataset to download the underlying files

# from pyspark.sql import functions as F
from transforms.api import transform, Output
import pandas as pd
import docx

'''
# ====================================================== #
# === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
# ====================================================== #

Description
-----------
Generates a spark dataframes containing docx files with strings contained in a source spark dataframe

Strategy
--------
1. Create dummy spark dataframe with primary key and random text
2. Use a udf to open filesystem and write a docx with the contents of text column above 

'''


@transform(
    output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
)
def compute(ctx, output):
    # gen data
    pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
    data = ctx.spark_session.createDataFrame(pdf)

    # function to write files
    def strings_to_doc(df, transform_output):
        rdd = df.rdd

        def generate_files(row):
            filename = row['name'] + '.docx'

            with transform_output.filesystem().open(filename, 'wb') as worddoc:
                doc = docx.Document()
                doc.add_heading(row['name'])
                doc.add_paragraph(row['content'])
                doc.save(worddoc)

        rdd.foreach(generate_files)

    return strings_to_doc(data, output)

A pandas udf will also work if you prefer the input to a pandas dataframe but you are forced define a schema which is inconvinient for your usage.