Search code examples
pythondatabricksnetcdfnetcdf4azure-data-lake-gen2

Read .nc files from Azure Datalake Gen2 in Azure Databricks


Trying to read .nc (netCDF4) files in Azure Databricks.

Never worked with .nc files

  1. All the required .nc files are in Azure Datalake Gen2
  2. Mounted above files into Databricks at "/mnt/eco_dailyRain"
  3. Can list the content of mount using dbutils.fs.ls("/mnt/eco_dailyRain") OUTPUT:

    Out[76]: [FileInfo(path='dbfs:/mnt/eco_dailyRain/2000.daily_rain.nc', name='2000.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc', name='2001.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2002.daily_rain.nc', name='2002.daily_rain.nc', size=428218181),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2003.daily_rain.nc', name='2003.daily_rain.nc', size=428217139),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2004.daily_rain.nc', name='2004.daily_rain.nc', size=429390143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2005.daily_rain.nc', name='2005.daily_rain.nc', size=428217137),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2006.daily_rain.nc', name='2006.daily_rain.nc', size=428217127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2007.daily_rain.nc', name='2007.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2008.daily_rain.nc', name='2008.daily_rain.nc', size=429390137),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2009.daily_rain.nc', name='2009.daily_rain.nc', size=428217127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2010.daily_rain.nc', name='2010.daily_rain.nc', size=428217134),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2011.daily_rain.nc', name='2011.daily_rain.nc', size=428218181),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2012.daily_rain.nc', name='2012.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2013.daily_rain.nc', name='2013.daily_rain.nc', size=428217143),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2014.daily_rain.nc', name='2014.daily_rain.nc', size=428218104),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2015.daily_rain.nc', name='2015.daily_rain.nc', size=428217134),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2016.daily_rain.nc', name='2016.daily_rain.nc', size=429390127),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2017.daily_rain.nc', name='2017.daily_rain.nc', size=428217223),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2018.daily_rain.nc', name='2018.daily_rain.nc', size=418143765),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/2019.daily_rain.nc', name='2019.daily_rain.nc', size=370034113),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/Consignments.parquet', name='Consignments.parquet', size=237709917),
     FileInfo(path='dbfs:/mnt/eco_dailyRain/test.nc', name='test.nc', size=428217137)]
    

Just to test wether can read from mount.

spark.read.parquet('dbfs:/mnt/eco_dailyRain/Consignments.parquet')

confirms can read parquet file.

output

Out[83]: DataFrame[CONSIGNMENT_PK: int, CERTIFICATE_NO: string, ACTOR_NAME: string, GENERATOR_FK: int, TRANSPORTER_FK: int, RECEIVER_FK: int, REC_POST_CODE: string, WASTEDESC: string, WASTE_FK: int, GEN_LICNUM: string, VOLUME: int, MEASURE: string, WASTE_TYPE: string, WASTE_ADD: string, CONTAMINENT1_FK: int, CONTAMINENT2_FK: int, CONTAMINENT3_FK: int, CONTAMINENT4_FK: int, TREATMENT_FK: int, ANZSICODE_FK: int, VEH1_REGNO: string, VEH1_LICNO: string, VEH2_REGNO: string, VEH2_LICNO: string, GEN_SIGNEE: string, GEN_DATE: timestamp, TRANS_SIGNEE: string, TRANS_DATE: timestamp, REC_SIGNEE: string, REC_DATE: timestamp, DATECREATED: timestamp, DISCREPANCY: string, APPROVAL_NUMBER: string, TR_TYPE: string, REC_WASTE_FK: int, REC_WASTE_TYPE: string, REC_VOLUME: int, REC_MEASURE: string, DATE_RECEIVED: timestamp, DATE_SCANNED: timestamp, HAS_IMAGE: string, LASTMODIFIED: timestamp]

But trying to read netCDF4 files says No such file or directory

Code:

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/
import matplotlib.pyplot as plt

rootgrp = Dataset("dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")

Error

FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'

Any clues.


Solution

  • According to the API reference of netCDF4 module for class Dataset, as the figure below.

    enter image description here

    The value of the path parameter for Dataset should be a path of unix directory format, but the path dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc is a format for PySpark as I known, so you got the error FileNotFoundError: [Errno 2] No such file or directory: b'dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc'.

    The solution to fix it is to change the path value dbfs:/mnt/eco_dailyRain/2001.daily_rain.nc with the equivalence unix path /dbfs/mnt/eco_dailyRain/2001.daily_rain.nc, the code as below.

    rootgrp = Dataset("/dbfs/mnt/eco_dailyRain/2001.daily_rain.nc","r", format="NETCDF4")
    

    You can check it via the code below to see it.

    %sh
    ls /dbfs/mnt/eco_dailyRain
    

    Ofcouse, you also can list your data files of netCDF4 format via dbutils.fs.ls('/mnt/eco_dailyRain') if you had mount it.