Search code examples
apache-sparkkubernetespysparkazure-data-lake

How to "mount" data lake gen 1 without databricks


We have pyspark codes that we want to run in kubernetes. It should pick up data from a data lake gen 1 storage. Now, I understand that in Databricks to be able to access files in data lake, it should be mounted first. I want to ask if: a.) it is possible b.) what is the approach


Solution

  • The easiest way I found to do it is by following this documentation from Apache Hadoop. Make sure you download the correct JARs into your classpath.

    You will need to set the various parameters in the hadoop core-site.xml file, and example of which is as follows, using ClientCredential and OAuth2 (I replaced private info with xxxx):

    <configuration>
      <property>
          <name>fs.adl.oauth2.access.token.provider.type</name>
          <value>ClientCredential</value>
      </property>
    
      <property>
          <name>fs.adl.oauth2.refresh.url</name>
          <value>https://login.microsoftonline.com/xxxx/oauth2/token</value>
      </property>
    
      <property>
          <name>fs.adl.oauth2.client.id</name>
          <value>xxxx</value>
      </property>
    
      <property>
          <name>fs.adl.oauth2.credential</name>
          <value>xxxx</value>
      </property>
    </configuration>