Search code examples
rgoogle-bigquerybigrquery

Bigrquery forcefully coerces strings to integers (schema is a string)


I'm working with zip codes, which of course have leading zeros. I am correctly loading my dataframe to preserve the leading zeros in R, but the upload step seems to fail. Here's what I mean:

Here's my minimal.csv file:

zip,val
07030,10
10001,100
90210,1000
60602,10000

Here's the R code

require("bigrquery")
filename <- "minimal.csv"
tablename <- "as_STRING"
ds <- bq_dataset(project='myproject', dataset="zips")

I am also correctly setting the type in my schema to expect them as strings.

# first pass
df <- read.csv(filename, stringsAsFactors=F)
# > df
#     zip   val
# 1  7030    10
# 2 10001   100
# 3 90210  1000
# 4 60602 10000

# uh oh!  Let's fix it!

cols <- unlist(lapply(df, class))
cols[[1]] <- "character" # make zipcode a character

# then reload
df2 <- read.csv(filename, stringsAsFactors=F, colClasses=cols)
# > df2
#     zip   val
# 1 07030    10
# 2 10001   100
# 3 90210  1000
# 4 60602 10000

# much better!  You can see my zips are now strings.

However, when I try to upload strings, the bigrquery interface complains that I am uploading integers, which they are not. Here's the schema, expecting strings:

# create schema
bq_table_create(bq_table(ds, tablename), fields=df2) # using df2, which has strings

# now prove it got the strings right:
    > bq_table_meta(bq_table(ds, tablename))$schema$fields
    [[1]]
    [[1]]$name
    [1] "zip"

    [[1]]$type
    [1] "STRING"                # GOOD, ZIP IS A STRING!

    [[1]]$mode
    [1] "NULLABLE"


    [[2]]
    [[2]]$name
    [1] "val"

    [[2]]$type
    [1] "INTEGER"

    [[2]]$mode
    [1] "NULLABLE"

Now it's time to upload....

bq_table_upload(bq_table(ds, tablename), df2) # using df2, with STRINGS
Error: Invalid schema update. Field zip has changed type from STRING to INTEGER [invalid]

Huh? What is this invalid schema update, and how can I stop it from trying to change my strings, which the data contains, and the schema is, to integers, which my data does not contain, and which the schema is not?

Is there a javascript serialization that's happening and turning my strings back to integers?


Solution

  • That is because BigQuery will auto-detect the schema when it is not specified. This could be solved by specifying fields argument, like this (see this similar question for more details):

    bq_table_upload(bq_table(ds, tablename), df2,fields = list(bq_field("zip", "string"),bq_field("val", "integer")))
    

    UPDATE:

    Looking into the code,bq_table_upload is calling bq_perform_upload, which take the argument fields as schema. At the end, it parses the data frame as JSON file to upload it to the BigQuery.