Search code examples
pythonhtmlsplitlangchainpy-langchain

Splitting HTML file and saving chunks using LangChain


I'm very new to LangChain, and I'm working with around 100-150 HTML files on my local disk that I need to upload to a server for NLP model training. However, I have to divide my information into chunks because each file is only permitted to have a maximum of 20K characters. I'm trying to use the LangChain library to do so, but I'm not being successful in splitting my files into my desired chunks.

For reference, I'm using this URL: http://www.hadoopadmin.co.in/faq/ Saved locally as HTML only.

It's a Hadoop FAQ page that I've downloaded as an HTML file onto my PC. There are many questions and answers there. I've noticed that sometimes, for some files, it gets split by a mere title, and another split is the paragraph following that title. But my desired output would be to have the title and the specific paragraph or following text from the body of the page, and as metadata, the title of the page.

I'm using this code:

from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_text_splitters import HTMLHeaderTextSplitter
# Same Example with the URL http://www.hadoopadmin.co.in/faq/ Saved Locally as HTML Only
dir_html_file='FAQ – BigData.html'

data_html = UnstructuredHTMLLoader(dir_html_file).load()

headers_to_split_on = [
    ("h1", "Header 1")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(str(data_html))

But is returning a bunch of weird characters and not splitting the document at all.

This is an output:

[Document(page_content='[Document(page_content=\'BigData\\n\\n"You can have data without information, but you cannot have information without Big data."\\n\\[email protected]\\n\\n+91-8147644946\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nToggle navigation\\n\\nHome\\n\\nBigData\\n\\n\\tOverview of BigData\\n\\tSources of BigData\\n\\tPros & Cons of BigData\\n\\tSolutions of BigData\\n\\nHadoop Admin\\n\\n\\tHadoop\\n\\t\\n\\t\\tOverview of HDFS\\n\\t\\tOverview of MapReduce\\n\\t\\tApache YARN\\n\\t\\tHadoop Architecture\\n\\t\\n\\n\\tPlanning of Hadoop Cluster\\n\\tAdministration and Maintenance\\n\\tHadoop Ecosystem\\n\\tSetup HDP cluster from scratch\\n\\tInstallation and Configuration\\n\\tAdvanced Cluster Configuration\\n\\tOverview of Ranger\\n\\tKerberos\\n\\t\\n\\t\\tInstalling kerberos/Configuring the KDC and Enabling Kerberos Security\\n\\t\\tConfigure SPNEGO Authentication for Hadoop\\n\\t\\tDisabled kerberos via ambari\\n\\t\\tCommon issues after Disabling kerberos via Ambari\\n\\t\\tEnable https for ambari Server\\n\\t\\tEnable SSL or HTTPS for Oozie Web UI\\n\\nHadoop Dev\\n\\n\\tSolr\\n\\t\\n\\t\\tSolr Installation\\n\\t\\tCommits and Optimizing in Solr and its use for NRT\\n\\t\\tSolr FAQ\\n\\t\\n\\n\\tApache Kafka\\n\\t\\n\\t\\tKafka QuickStart\\n\\t\\n\\n\\tGet last access time of hdfs files\\n\\tProcess hdfs data with Java\\n\\tProcess hdfs data with Pig\\n\\tProcess hdfs data with Hive\\n\\tProcess hdfs data with Sqoop/Flume\\n\\nBigData Architect\\n\\n\\tSolution Vs Enterprise Vs Technical Architect’s Role and Responsibilities\\n\\tSolution architect certification\\n\\nAbout me\\n\\nFAQ\\n\\nAsk Questions\\n\\nFAQ\\n\\nHome\\n\\nFAQ\\n\\nFrequently\\xa0Asked Questions about Big Data\\n\\nMany questions about big data have yet to be answered in a vendor-neutral way. With so many definitions, opinions run the gamut. Here I will attempt to cut to the heart of the matter by addressing some key questions I often get from readers, clients and industry analysts.\\n\\n1) What is Big Data?\\n\\n1) What is Big Data?\\n\\nBig data” is an all-inclusive term used to describe vast amounts of information. In contrast to traditional structured data which is typically stored in a relational database, big data varies in terms of volume, velocity, and variety.\\n\\nBig data\\xa0is characteristically generated in large volumes – on the order of terabytes or exabytes of data (starts with 1 and has 18 zeros after it, or 1 million terabytes) per individual data set.\\n\\nBig data\\xa0is also generated with high velocity – it is collected at frequent intervals – which makes it difficult to analyze (though analyzing it rapidly makes it more valuable).\\n\\nOr in simple words we can say “Big Data includes data sets whose size is beyond the ability of traditional software tools to capture, manage, and process the data in a reasonable time.”\\n\\n2) How much data does it take to be called Big Data?\\n\\nThis question cannot be easily answered absolutely. Based on the infrastructure on the market the lower threshold is at about 1 to 3 terabytes.\\n\\nBut using Big Data technologies can be sensible for smaller databases as well, for example if complex mathematiccal or statistical analyses are run against a database. Netezza offers about 200 built in functions and computer languages like Revolution R or Phyton which can be used in such cases.\\n\\

My Expected output will look something like this:

One chunk:

Frequently Asked Questions about Big Data

Many questions about big data have yet to be answered in a vendor-neutral way. With so many definitions, opinions run the gamut. Here I will attempt to cut to the heart of the matter by addressing some key questions I often get from readers, clients and industry analysts.

1) What is Big Data?
“Big data” is an all-inclusive term used to describe vast amounts of information. In contrast to traditional structured data which is typically stored in a relational database, big data varies in terms of volume, velocity, and variety. Big data is characteristically generated in large volumes – on the order of terabytes or exabytes of data (starts with 1 and has 18 zeros after it, or 1 million terabytes) per individual data set. Big data is also generated with high velocity – it is collected at frequent intervals – which makes it difficult to analyze (though analyzing it rapidly makes it more valuable).
Or in simple words we can say “Big Data includes data sets whose size is beyond the ability of traditional software tools to capture, manage, and process the data in a reasonable time.”
2) How much data does it take to be called Big Data?
This question cannot be easily answered absolutely. Based on the infrastructure on the market the lower threshold is at about 1 to 3 terabytes.
But using Big Data technologies can be sensible for smaller databases as well, for example if complex mathematical or statistical analyses are run against a database. Netezza offers about 200 built in functions and computer languages like Revolution R or Phyton which can be used in such cases.

Metadata: FAQ


Another Chunck
7) Where is the big data trend going?
Eventually the big data hype will wear off, but studies show that big data adoption will continue to grow. With a projected $16.9B market by 2015 (Wikibon goes even further to say $50B by 2017), it is clear that big data is here to stay. However, the big data talent pool is lagging behind and will need to catch up to the pace of the market. McKinsey & Company estimated in May 2011 that by 2018, the US alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.
The emergence of big data analytics has permanently altered many businesses’ way of looking at data. Big data can take companies down a long road of staff, technology, and data storage augmentation, but the payoff – rapid insight into never-before-examined data – can be huge. As more use cases come to light over the coming years and technologies mature, big data will undoubtedly reach critical mass and will no longer be labeled a trend. Soon it will simply be another mechanism in the BI ecosystem.
8) Who are some of the BIG DATA users?
From cloud companies like Amazon to healthcare companies to financial firms, it seems as if everyone is developing a strategy to use big data. For example, every mobile phone user has a monthly bill which catalogs every call and every text; processing the sheer volume of that data can be challenging. Software logs, remote sensing technologies, information-sensing mobile devices all pose a challenge in terms of the volumes of data created. The size of Big Data can be relative to the size of the enterprise. For some, it may be hundreds of gigabytes, for others, tens or hundreds of terabytes to cause consideration.
9) Data visualization is becoming more popular than ever.
In my opinion, it is absolutely essential for organizations to embrace interactive data visualization tools. Blame or thank big data for that and these tools are amazing. They are helping employees make sense of the never-ending stream of data hitting them faster than ever. Our brains respond much better to visuals than rows on a spreadsheet.
Companies like Amazon, Apple, Facebook, Google, Twitter, Netflix and many others understand the cardinal need to visualize data. And this goes way beyond Excel charts, graphs or even pivot tables. Companies like Tableau Software have allowed non-technical users to create very interactive and imaginative ways to visually represent information.

Metadata: FAQ  

My thought process is being able to gather all the information and split it into chunks, but I don't want titles without their following paragraphs separated, and I also want as much info as possible (max 20K characters) before creating another chunk.

I would also like to save these chunks and their meta data. Is there a function in LangChain to do this?

I am open to hearing not to do this in LangChain for efficiency reasons.


Solution

  • check this super amazing HTML chunking package :package: pip install html_chunking

    • Our HTML chunking algorithm operates through a well-structured process that involves several key stages, each tailored to efficiently chunk and merge HTML content while adhering to a token limit. This approach is highly suitable for scenarios where token limitations are critical, and the need for accurate HTML parsing is paramount, especially in tasks like web automation or navigation where HTML content serves as input.

    • For those of you who are interested in this, here's a demo

    from html_chunking import get_html_chunks
    merged_chunks = get_html_chunks(your_html_string_here, max_tokens=1000, is_clean_html=True, attr_cutoff_len=25)
    merged_chunks
    
    • The output should consists of several HTML chunks, where each chunk contains valid HTML code with preserved structure and attributes (from root node all the way down to current node), and any excessively long attributes are truncated to the specified length.

    Check out the html_chunking PYPI page and our Github page for more example DEMO!!

    • For those who are investigating the BEST way of chunking HTML for web automation or any other web agent tasks, you should definitely try html_chunking!!

    • LangChain (HTMLHeaderTextSplitter & HTMLSectionSplitter) and LlamaIndex (HTMLNodeParser) split text at the element level and add metadata for each header relevant to the chunk. However, they extract only the text content and exclude the HTML structure, attributes, and other non-text elements, limiting their use for tasks requiring the full HTML context.

    • Check our Github repo below and star :star2: https://github.com/KLGR123/html_chunking