Search code examples
pysparkdatabricks-community-edition

Cannot apply count() or collecr() on RDD from textfile(Spark)


I am new at Spark and I have Databricks Community Edition account. Right now I'm doing Lab and encountered with following error:

!rm README.md* -f 
!wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md

textfile_rdd = sc.textFile("README.md")
textfile_rdd.count()

Output:

IllegalArgumentException: Path must be absolute: dbfs:/../dbfs/README.md

Solution

  • By default, wget will download your file to /databricks/driver You have to store it in the DataBricks File System (dbfs) in order to be able to read it with the -P option. See wget manual for reference. It also seems that the !wget magic creates a file that is not available with the dbfs:/ path. On Databricks Community, !wget leads to a file not found as you mentionned.

    You can do the following in a %sh cell first:

    %sh
    rm README.md* -f 
    wget https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md -P /dbfs/downloads/
    

    And then in a second python cell, you can access the file throug the Files API (note the path starting with file:/

    textfile_rdd = sc.textFile("file:/dbfs/downloads/README.md")
    textfile_rdd.count()
    
    --2022-02-11 13:48:19--  https://raw.githubusercontent.com/carloapp2/SparkPOT/master/README.md
    Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
    Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 3624 (3.5K) [text/plain]
    Saving to: ‘/dbfs/FileStore/README.md.1’
    
    README.md.1         100%[===================>]   3.54K  --.-KB/s    in 0.001s  
    
    2022-02-11 13:48:19 (4.10 MB/s) - ‘/dbfs/FileStore/README.md.1’ saved [3624/3624]
    
    Out[25]: 98
    

    The following solution has been tested on a Databricks Community Edition with a 7.1 LTS ML and a 9.1 LTS ML Databricks Runtime. enter image description here