For one of my classes, I have to analyze a "big data" dataset. I found the following dataset on the AWS Registry of Open Data that seems interesting:
https://registry.opendata.aws/openaq/
How exactly can I create a connection and load this dataset into Databricks? I've tried the following:
df = spark.read.format("text").load("s3://openaq-fetches/")
However, I receive the following error:
java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
Also, it seems that this dataset has multiple folders. How do I access a particular folder in Databricks, and if possible, can I focus on a particular time range? Let's say, from 2016 to 2020?
Ultimately, I would like to perform various SQL queries in order to analyze the dataset and perhaps create some visualizations as well. Thank you in advance.
If you browse the bucket, you'll see that there are multiple datasets there, in different formats, that will require different access methods. So you need to point to the specific folder (and maybe its subfolder to load data). Like, to load daily
dataset you need to use CSV format:
df = spark.read.format("csv").option("inferSchema", "true")\
.option("header", "false").load("s3://openaq-fetches/daily/")
To load only subset of the data you can use path filters, for example. See Spark documentation on loading data.
P.S. the inferSchema
isn't very optimal from performance standpoint, so it's better to explicitly provide schema when reading.