Search code examples
web-crawlerpdf-generation

How can I create a single hyperlinked PDF containing content from 2 different Websites?


I have a particular use case and after some online searching suspect there is no pre-built solution, so I'm curious what you would recommend for implementing this.

I have a Table of Contents on domain A (https://true-freedom.net/) and each entry of the TOC links to exactly one post on domain B (https://www.quora.com/).

My goal is to create a single PDF of both the TOC on domain A as well as the individual posts on domain B, with intra-PDF links from the TOC to the posts, all within the same PDF.

Which tool, language, library, etc would you use to do this, and why ?


Solution

  • After posting this question I signed up to ChatGPT and the first question I asked was the question I posted above. I was impressed with the answer and the code example it provided.

    The basic approach was the following (using Python):

    With BeautifulSoup, get all the links from the table of contents page:

    soup = BeautifulSoup(response.content, "html.parser")
    post_urls = [link["href"] for link in soup.find_all("a") if "quora.com" in link["href"]]
    

    then create a single HTML file containing both the TOC and the content from the links in the TOC.

    Create anchor links between the TOC and the individual posts.

    Then use wkhtmltopdf to convert the final html file to pdf:

    subprocess.run(["wkhtmltopdf", "all.html", "output.pdf"])