Search code examples
pythonrepositorytransformdocxpalantir-foundry

Output .docx document using repository in palantir foundry


Since the foundry documentation is rather patchy and didn't really provide an answer: Is it somehow possible to use a foundry code repository (python-docx library is available and used) and a df as input to produce word documents (.docx) as output? I thought that maybe using a composition of the transform input/output and py-docx document.save() functionality may work but I couldn't come up with a proper solution.

    from pyspark.sql import functions as F
    from transforms.api import transform, transform_df, Input, Output
    import os, docx
    import pandas as pd
    
    @transform(
        output = Output("some_folder/"),
        source_df = Input(""),
    )
    
    def compute(source_df, output):
        df = source_df.dataframe()
        test = df.toPandas()
        document = docx.Document()
        doc.add_paragraph(str(test.loc[1,1])
        document.save('test.docx')
        output.write_dataframe(df)

This code ofc does't work, but would appreciate a working solution (in an ideal world it would be possible to have multiple .docx as output).


Solution

  • Your best bet is to use spark to distribute the file generation over executors. This transformation generates a word doc for each row and stores in a dataset container, which is recommended over using Compass (Foundry's folder system). Browse to the dataset to download the underlying files

    # from pyspark.sql import functions as F
    from transforms.api import transform, Output
    import pandas as pd
    import docx
    
    '''
    # ====================================================== #
    # === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
    # ====================================================== #
    
    Description
    -----------
    Generates a spark dataframes containing docx files with strings contained in a source spark dataframe
    
    Strategy
    --------
    1. Create dummy spark dataframe with primary key and random text
    2. Use a udf to open filesystem and write a docx with the contents of text column above 
    
    '''
    
    
    @transform(
        output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
    )
    def compute(ctx, output):
        # gen data
        pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
        data = ctx.spark_session.createDataFrame(pdf)
    
        # function to write files
        def strings_to_doc(df, transform_output):
            rdd = df.rdd
    
            def generate_files(row):
                filename = row['name'] + '.docx'
    
                with transform_output.filesystem().open(filename, 'wb') as worddoc:
                    doc = docx.Document()
                    doc.add_heading(row['name'])
                    doc.add_paragraph(row['content'])
                    doc.save(worddoc)
    
            rdd.foreach(generate_files)
    
        return strings_to_doc(data, output)
    

    A pandas udf will also work if you prefer the input to a pandas dataframe but you are forced define a schema which is inconvinient for your usage.