I am using a Python script that imports CSV data, filters and sorts it, converts it to HTML and then PDF. I'd like to find a way to add page breaks after specific rows.
Assume the following example:
The data is sorted by columns col1
forming "groups". I'd like to add a page break after every group (new value in col1
):
Input data (CSV table)
col1 | col2 | col3 |
---|---|---|
A | x | a |
A | y | b |
B | x | a |
B | x | a |
B | y | b |
B | x | a |
C | x | a |
C | y | b |
Output data (table in PDF)
(page breaks added, column headings repeated every page)
col1 | col2 | col3 |
---|---|---|
A | x | a |
A | y | b |
pagebreak | ||
col1 | col2 | col3 |
B | x | a |
B | x | a |
B | y | b |
B | x | a |
pagebreak | ||
col1 | col2 | col3 |
C | x | a |
C | y | b |
My workflow briefly looks as follows:
df = pd.read_csv(input_filename, encoding="")
filtered_df = df[some_condition]
filtered_df = filtered_df.sort_values(some_other_condition)
html_table = filtered_df.to_html(index=False)
html_string = html_head + html_something + html_table + html_something_else + html_foot
pdfkit.from_string(html_string, outfile_name, options=pdfkit_options)
I see the following alternative approaches (but don't have a clue how to implement them yet, and I don't like any of them):
pandas.DataFrame.to_html
option I don't know about.I don't know all the values col1
holds in advance, but it's probably easy to find them out once and reuse them for further processing.
Any help is very much appreciated.
I use the following approach (#2 of OP):
Split single table up into smaller tables ("sub tables") and converting them one after each other to HTML, putting all HTML tables together with page breaks in between.
Get unique values of the group_column
(in my example that would be col1
)
Iterate over the groups and filter the pandas.DataFrame
to select only rows that match the group.
Add page break after every group/table (except after the last one).
html_page_break = '<div style="page-break-after: always;"></div>'
groups = filtered_df[group_column].unique()
html_string = html_head + html_note
for i, group in enumerate(groups):
filtered_df_subtable = filtered_df[(filtered_df[group_column] == group)]
html_table = filtered_df_subtable.to_html(index=False)
html_string += html_table
if i < len(groups) - 1:
html_string += html_page_break
html_string += html_foot
The conversion from HTML to PDF is untouched.
Thanks @notarealgreal for html_page_break
+ enumerate
.