Search code examples
pythongoogle-cloud-vertex-ai

Vertex AI was unable to import data into dataset. It says maximum 1M lines while my dataset only have 600k


I'm importing a text dataset to Google Vertex AI and got the following error:

Hello Vertex AI Customer,

Due to an error, Vertex AI was unable to import data into 
dataset [dataset_name].
Additional Details:
Operation State: Failed with errors
Resource Name: [resoure_link]
Error Messages: There are too many rows in the jsonl/csv file. Currently we 
only support 1000000 lines. Please cut your files to smaller size and run 
multiple import data pipelines to import.

I checked my dataset which I generated from pandas and the actual CSV file, it only have 600k lines.

Anyone got similar errors?


Solution

  • So it turns out to be an error in my CSV formatting.

    I forgot to trim newlines and extra whitespaces in my text dataset. This solved the 1M lines count. But after doing that, I then get error telling me I have too much labels while it was only 2.

    Error Messages: There are too many AnnotationSpecs in the dataset. Up to 
    5000 AnnotationSpecs are allowed in one Dataset.
    

    And this is because I created the text dataset using to_csv() method in Pandas dataframe. Creating a CSV file this way, it will automatically put quotes when your text include a "," (comma character) only. So the CSV file will look like:

    "this is a sentence, with a comma", 0
    this is a sentence without a comma, 1
    

    Meanwhile, what Vertex AutoML Text wants the CSV is to look like this:

    "this is a sentence, with a comma", 0
    "this is a sentence without a comma", 1
    

    i.e. you have to put quotes on every line.

    Which you can achieve by writing your own CSV formatter, or if you insist on using Pandas to_csv(), you can pass csv.QUOTE_ALL to the quoting parameter. It will look like this:

    import csv
    df.to_csv("file.csv", index=False, quoting=csv.QUOTE_ALL, header=False)