Vertex AI was unable to import data into dataset. It says maximum 1M lines while my dataset only have 600k

I'm importing a text dataset to Google Vertex AI and got the following error:

Hello Vertex AI Customer,

Due to an error, Vertex AI was unable to import data into 
dataset [dataset_name].
Additional Details:
Operation State: Failed with errors
Resource Name: [resoure_link]
Error Messages: There are too many rows in the jsonl/csv file. Currently we 
only support 1000000 lines. Please cut your files to smaller size and run 
multiple import data pipelines to import.

I checked my dataset which I generated from pandas and the actual CSV file, it only have 600k lines.

Anyone got similar errors?

Solution

So it turns out to be an error in my CSV formatting.

I forgot to trim newlines and extra whitespaces in my text dataset. This solved the 1M lines count. But after doing that, I then get error telling me I have too much labels while it was only 2.

Error Messages: There are too many AnnotationSpecs in the dataset. Up to 
5000 AnnotationSpecs are allowed in one Dataset.

And this is because I created the text dataset using to_csv() method in Pandas dataframe. Creating a CSV file this way, it will automatically put quotes when your text include a "," (comma character) only. So the CSV file will look like:

"this is a sentence, with a comma", 0
this is a sentence without a comma, 1

Meanwhile, what Vertex AutoML Text wants the CSV is to look like this:

"this is a sentence, with a comma", 0
"this is a sentence without a comma", 1

i.e. you have to put quotes on every line.

Which you can achieve by writing your own CSV formatter, or if you insist on using Pandas to_csv(), you can pass csv.QUOTE_ALL to the quoting parameter. It will look like this:

import csv
df.to_csv("file.csv", index=False, quoting=csv.QUOTE_ALL, header=False)