Since the foundry documentation is rather patchy and didn't really provide an answer: Is it somehow possible to use a foundry code repository (python-docx library is available and used) and a df as input to produce word documents (.docx) as output? I thought that maybe using a composition of the transform input/output and py-docx document.save() functionality may work but I couldn't come up with a proper solution.
from pyspark.sql import functions as F
from transforms.api import transform, transform_df, Input, Output
import os, docx
import pandas as pd
@transform(
output = Output("some_folder/"),
source_df = Input(""),
)
def compute(source_df, output):
df = source_df.dataframe()
test = df.toPandas()
document = docx.Document()
doc.add_paragraph(str(test.loc[1,1])
document.save('test.docx')
output.write_dataframe(df)
This code ofc does't work, but would appreciate a working solution (in an ideal world it would be possible to have multiple .docx as output).
Your best bet is to use spark to distribute the file generation over executors. This transformation generates a word doc for each row and stores in a dataset container, which is recommended over using Compass (Foundry's folder system). Browse to the dataset to download the underlying files
# from pyspark.sql import functions as F
from transforms.api import transform, Output
import pandas as pd
import docx
'''
# ====================================================== #
# === [DISTRIBUTED GENERATION OF FILESYSTEM OUTPUTS] === #
# ====================================================== #
Description
-----------
Generates a spark dataframes containing docx files with strings contained in a source spark dataframe
Strategy
--------
1. Create dummy spark dataframe with primary key and random text
2. Use a udf to open filesystem and write a docx with the contents of text column above
'''
@transform(
output=Output("ri.foundry.main.dataset.7e0f243f-e97f-4e05-84b3-ebcc4b4a2a1c")
)
def compute(ctx, output):
# gen data
pdf = pd.DataFrame({'name': ['docx_1', 'docx_2'], 'content': ['doc1 content', 'doc2 content']})
data = ctx.spark_session.createDataFrame(pdf)
# function to write files
def strings_to_doc(df, transform_output):
rdd = df.rdd
def generate_files(row):
filename = row['name'] + '.docx'
with transform_output.filesystem().open(filename, 'wb') as worddoc:
doc = docx.Document()
doc.add_heading(row['name'])
doc.add_paragraph(row['content'])
doc.save(worddoc)
rdd.foreach(generate_files)
return strings_to_doc(data, output)
A pandas udf will also work if you prefer the input to a pandas dataframe but you are forced define a schema which is inconvinient for your usage.