Search code examples
sqlcsvobject-storage

How does IBM SQL Query know the schema I'm using in my CSV files?


I'm interested in using the IBM SQL Query service to ask questions of some files I have in object storage, but how does it know what schema I'm using in these files. How does it know the data types it should use for columns in my CSV files? I don't see any "CREATE TABLE" mechanism that would allow me to define field names and data types.


Solution

  • IBM SQL Query relies on schema at read. It performs a schema inference for each data set that is referenced by a SQLs query. IBM SQL Query uses Apache Spark as the execution engine, which also is used to do the schema inferencing.

    In case of CSV data (as well as when querying JSON data) this schema inference involved scanning the data values in order to derive the data types for each column. When your data volumes become larger and you want to optimize performance of your SQLs it is a good practice to store your data in a bit more elaborate formats, such as parquet, which combines meta data with the data values. Schema inference is very efficient and fast in this case because IBM SQL Query only needs to read the parquet footer to retrieve the schema of your data.

    This effect becomes even more evident when you want to query big data sets that are distributed across multiple partitions of objects in object storage. In this case IBM SQL query does indeed only have to retrieve one of the parquet objects (i.e. one of the partitions) and read it's footer in order to infer the schema for the entire data set.