Search code examples
pythonazureazure-synapseazure-data-lake-gen2

store a simple string as text file in azure synapse (to data lake gen2)


I am trying to do store a simple string as a text file in datalakeGen2 with python code written in synapse notebook. But it doesn't seems to be straight forward.

I tried to convert the text into rdd and then store:

from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
str = "test string"

text_path = adls_path + 'test.xml'

rdd_text = sc.parallelize(list(str)).collect()
# type(rdd_text)

rdd_text.saveAsTextFile(text_path)

but it gives out error as:

AttributeError: 'list' object has no attribute 'saveAsTextFile'
Traceback (most recent call last):

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Solution

  • enter image description here As python rdd_text = sc.parallelize(list(str)).collect() so here, you are results are stored in the form of list in the rdd_text. As it a normal python statement because collect() returns a list.

    RDD is a distributed data structure and basic abstraction in spark, which is immutable.

    For eg, remove() or append() are the objects of lists in python so as to add or remove element -as such save saveAsTextFile is the object of RDD to write the file.

    As in the below picture you can see tuple() has no attribute as append because they are immutable so are RDD. enter image description here

    Hence, instead of python rdd_text = sc.parallelize(list(str)).collect() could use python rdd_text = sc.parallelize(list(str)) so it wont store the result as a List.

    from pyspark import SparkConf
    from pyspark import SparkContext
    
    sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
    
    string = "test string"
    adls_path="abfss://data@xxxxxxxx.dfs.core.windows.net/symbolexcel.xlsx"
    
    text_path = adls_path  + 'test.xlsx'
    rdd_text = sc.parallelize(list(string))
    
    rdd_text.saveAsTextFile(text_path)