Search code examples
jsonpysparkazure-blob-storagefolium

issue with connecting data in databricks from data lake and reading JSON into Folium


i'm working on something based of this blogpost:

https://python-visualization.github.io/folium/quickstart.html#Getting-Started specifically part 13 - using Cloropleth maps:

the piece of code they use is the following:

import pandas as pd

url = (
    "https://raw.githubusercontent.com/python-visualization/folium/master/examples/data"
)
state_geo = f"{url}/us-states.json"
state_unemployment = f"{url}/US_Unemployment_Oct2012.csv"
state_data = pd.read_csv(state_unemployment)

m = folium.Map(location=[48, -102], zoom_start=3)

folium.Choropleth(
    geo_data=state_geo,
    name="choropleth",
    data=state_data,
    columns=["State", "Unemployment"],
    key_on="feature.id",
    fill_color="YlGn",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="Unemployment Rate (%)",
).add_to(m)

folium.LayerControl().add_to(m)

m

if I use this I get the requested map.

Now I try to do this with my own data; i work in databricks

so I have a JSON with the GEOJSON data (source_file1) and a CSV file (source_file2) with the data that needs to be "plotted" on the map.

source_file1 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/Belgie_GEOJSON.JSON"
state_geo = spark.read.json(source_file1,multiLine=True)

source_file2 = "dbfs:/mnt/sandbox/MAARTEN/TOPO/DATASVZ.csv" 
df_2 =  spark.read.format("CSV").option("inferSchema", "true").option("header", "true").option("delimiter",";").load(source_file2)
state_data = df_2.toPandas() 

when adjusting the code below:

m = folium.Map(location=[48, -102], zoom_start=3)

folium.Choropleth(
  geo_data=state_geo,
  name="choropleth",
  data=state_data,
  columns=["State", "Unemployment"],
  key_on="feature.properties.name_nl",
  fill_color="YlGn",
  fill_opacity=0.7,
  line_opacity=0.2,
  legend_name="% Marktaandeel CC",
).add_to(m)

folium.LayerControl().add_to(m)

m

So i upload the geo_data parameter as a Sparkdatafram, I get the following error:

ValueError: Cannot render objects with any missing geometries: DataFrame[features: array<struct<geometry:struct<coordinates:array<array<array<string>>>,type:string>,properties:struct<arr_fr:string,arr_nis:bigint,arr_nl:string,fill:string,fill-opacity:double,name_fr:string,name_nl:string,nis:bigint,population:bigint,prov_fr:string,prov_nis:bigint,prov_nl:string,reg_fr:string,reg_nis:string,reg_nl:string,stroke:string,stroke-opacity:bigint,stroke-width:bigint>,type:string>>, type: string]```

I think it is because transforming the data from the "blob format" in the Azure datalake to the sparkdataframe, something goes wrong with the format. I tested this in a jupyter notebook from my desktop, data straight from file to folium and it all works.

If i load it directly from the source, like the example does with their webpage, so i adjust the 'geo_data' parameter for the folium function:

m = folium.Map(location=[48, -102], zoom_start=3)

folium.Choropleth(
  geo_data=source_file1,     #this gets adjusted directly to data lake
  name="choropleth",
  data=state_data,
  columns=["State", "Unemployment"],
  key_on="feature.properties.name_nl",
  fill_color="YlGn",
  fill_opacity=0.7,
  line_opacity=0.2,
  legend_name="% Marktaandeel CC",
).add_to(m)

folium.LayerControl().add_to(m)

m

I get the error

Use "/dbfs", not "dbfs:": The function expects a local file path. The error is caused by passing a path prefixed with "dbfs:".

So I started wondering what is the difference between my JSON file and the one of the blogpost. And the only thing i can imagine is that the Azure datalake doesn't store my json as a json but as a block blob file and for some reason i am not converting it properly so that folium can read it.

Azure blob storage (data lake)

So can someone with folium knowledge let me know if A. it is not possible to load the geo_data directly from a datalake ? B. in what format I need to upload the data ?

any thoughts on this would be helpfull!!!

thanks in advance!


Solution

  • Solved this issue, just had to replace "dbfs:" with "/dbfs". I tried it a lot of times but used "/dbfs:" and got another error.

    can't believe i'm this stupid :-)