I'm importing a text dataset to Google Vertex AI and got the following error:
Hello Vertex AI Customer,
Due to an error, Vertex AI was unable to import data into
dataset [dataset_name].
Additional Details:
Operation State: Failed with errors
Resource Name: [resoure_link]
Error Messages: There are too many rows in the jsonl/csv file. Currently we
only support 1000000 lines. Please cut your files to smaller size and run
multiple import data pipelines to import.
I checked my dataset which I generated from pandas and the actual CSV file, it only have 600k lines.
Anyone got similar errors?
So it turns out to be an error in my CSV formatting.
I forgot to trim newlines and extra whitespaces in my text dataset. This solved the 1M lines count. But after doing that, I then get error telling me I have too much labels while it was only 2.
Error Messages: There are too many AnnotationSpecs in the dataset. Up to
5000 AnnotationSpecs are allowed in one Dataset.
And this is because I created the text dataset using to_csv() method in Pandas dataframe. Creating a CSV file this way, it will automatically put quotes when your text include a "," (comma character) only. So the CSV file will look like:
"this is a sentence, with a comma", 0
this is a sentence without a comma, 1
Meanwhile, what Vertex AutoML Text wants the CSV is to look like this:
"this is a sentence, with a comma", 0
"this is a sentence without a comma", 1
i.e. you have to put quotes on every line.
Which you can achieve by writing your own CSV formatter, or if you insist on using Pandas to_csv(), you can pass csv.QUOTE_ALL to the quoting parameter. It will look like this:
import csv
df.to_csv("file.csv", index=False, quoting=csv.QUOTE_ALL, header=False)