Search code examples
typescriptlangchain

TypeScript LangChain add field to document metadata


How should I add a field to the metadata of Langchain's Documents?

For example, using the CharacterTextSplitter gives a list of Documents:

const splitter = new CharacterTextSplitter({
  separator: " ",
  chunkSize: 7,
  chunkOverlap: 3,
});
splitter.createDocuments([text]);

A document will have the following structure:

{
  "pageContent": "blablabla",
  "metadata": {
    "name": "my-file.pdf",
    "type": "application/pdf",
    "size": 12012,
    "lastModified": 1688375715518,
    "loc": { "lines": { "from": 1, "to": 3 } }
  }
}

And I want to add a field to the metadata


Solution

  • It isn't currently shown how to do this in the recommended text splitter documentation, but the 2nd argument of createDocuments can take an array of objects whose properties will be assigned into the metadata of every element of the returned documents array.

    myMetaData = { url: "https://www.google.com" }
    const documents = await splitter.createDocuments([text], [myMetaData],
      { chunkHeader, appendChunkOverlapHeader: true });
    

    After this, documents will contain an array, with each element being an object with pageContent and metaData properties. Under metaData, the properties from myMetaData above will also appear. pageContent will also have the text of chunkHeader prepended.

    {
      pageContent: <chunkHeader plus the chunk>,
      metadata: <all properties of myMetaData plus loc (text line numbers of chunk)>
    }