Search code examples
pythongoogle-bigquerygoogle-cloud-platformgoogle-cloud-python

Should BigQuery throw an error using table.insert_data() with skip_invalid_rows=True?


I'm using the BigQuery API for Python to insert data in a table with table.insert_data().

Sometimes a row will contain a None value for a field that is set to REPEATED mode in the table schema. When this happens, the following error is returned by the API:

[{'index': 48, 
'errors': [
    {u'debugInfo': u'generic::invalid_argument: Field value cannot be empty.', 
     u'reason': u'invalid', 
     u'message': u'Field value cannot be empty.', 
     u'location': u'name_of_my_field'}]}]

I would like these rows to be ignored. This is the prototype of the function i am using, that can also be found here:

insert_data(rows, row_ids=None, skip_invalid_rows=None, ignore_unknown_values=None, template_suffix=None, client=None)

And this is the parameter I am interested in because it seems to perfectly fit my needs:

skip_invalid_rows (bool) – (Optional) skip rows w/ invalid data?

It mirrors this parameter in the main API:

skipInvalidRows boolean [Optional] Insert all valid rows of a request, even if invalid rows exist. The default value is false, which causes the entire request to fail if any invalid rows exist.

However by using the function with skip_invalid_rows=True, the same errors are thrown.

Does this parameter mean what I think it means?

Worth pointing out:

  • why is there a question mark in the doc at the end of the description of skip_invalid_rows in the doc?
  • ignore_unknown_values=True works fine for its own purpose.

Any help much appreciated. :)


Solution

  • Maybe it's confusing but the error should still be thrown as a way to tell the end user that something didn't process as expected.

    It does not mean though that correct rows are not being saved, they should be! If you check your table, you should see these rows being written normally there.

    You can run a simple test like:

    table.insert_data([('1', ['1', None, '2']), ('2', ['1', '2'])])
    

    In this case, the value ('2', ['1', '2']) should be saved normally.

    I agree though on the docstrings comments for this method not being that helpful. I myself was about to make a pull request adding some new features for the Table resource but it's expected some major refactoring of this module in the near future so for now you might find these type of things there (and some other things that also might be confusing such as job resource defined in the table one).