Search code examples
google-cloud-storagegoogle-cloud-vertex-ai

How to automate .txt ingestion from GCS into Vertex AI datastore?


I'm trying to simplify an AI project with RAG.

  1. RAG (retrieval augmented generation) part is handled by Google Vertex AI Search
  2. LLM is handled either with a local LLM or OpenAI & cie API

I'm struggling a bit with (1): I can of course upload manually my .txt files via GCloud console UI into my Vertex Datastore, but I cannot succeed doing in programmatically.

  1. I've tried running this sample code...
  2. but I've got an error: "The provided GCS URI has invalid unstructured data format. Please provide a valid GCS path in either NDJSON(.ndjson) or JSON Lines(.jsonl) format." (of course my data are unstructrued raw txt!).
  3. I've then tried to make data_schema="document", without success..

Do you know if it's possible to upload programmatically some .txt into a Vertex Datastore ?
Is there a simplest way to keep in sync a GCS bucket with a Vertex Datastore ?
It seems no NodeJS lib exits for importing data into Vertex Datastore: weird...

Thanks


Solution

  • You need to set the data_schema to 'content'.

    This notebook may help: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/search/create_datastore_and_search.ipynb