csv large-language-model chunking openaiembeddings

How to select chunk size of data for embedding with an LLM?

I have structured data (CSV) that has a column of semantically rich text of variable length. I could mine the data so the CSV file has a max length per row of data by using an LLM to summarize semantically rich text to a max size. I’m using OpenAI GPT 3.5Turbo.

Is it important to pick a chunk size that accommodates the max possible size of a row? Or does it matter very little and I can work with a variable row size, select a median chunk size for my data, and let the LLM deal with receiving some records that are split into separate chunks?

Solution

For CSV data, it’s best to fit a row of data alone within a single chunk. For different types of data (non-csv or not-record based) this answer may not apply. This answer can be generalized to all row based data, independent of CSV format.

Background: Since it is CSV data, it’s implied that content within a row has a strong semantic relationship and that there is little to no semantic relationship with the next row or previous, ie, row ordering can be random because the rows are independent of each other.

So when generating embedding for this kind of data, where the LLM is to generate responses using semantic meaning between rows, the goal is that each row of the CSV becomes a vector so that when a LLM is queried, it generates answers oriented around the semantic content among various rows (which is the goal in this case), which means these answers are based upon the fit among the CSV.

For more background Chunking Strategies for LLM Applications is a good source.