I am trying to do store a simple string as a text file in datalakeGen2 with python code written in synapse notebook. But it doesn't seems to be straight forward.
I tried to convert the text into rdd and then store:
from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
str = "test string"
text_path = adls_path + 'test.xml'
rdd_text = sc.parallelize(list(str)).collect()
# type(rdd_text)
rdd_text.saveAsTextFile(text_path)
but it gives out error as:
AttributeError: 'list' object has no attribute 'saveAsTextFile'
Traceback (most recent call last):
AttributeError: 'list' object has no attribute 'saveAsTextFile'
As
python rdd_text = sc.parallelize(list(str)).collect()
so here, you are results are stored in the form of list in the rdd_text
. As it a normal python statement because collect()
returns a list.
RDD is a distributed data structure and basic abstraction in spark, which is immutable.
For eg, remove()
or append()
are the objects of lists in python so as to add or remove element -as such save saveAsTextFile
is the object of RDD to write the file.
As in the below picture you can see tuple()
has no attribute as append because they are immutable so are RDD.
Hence, instead of python rdd_text = sc.parallelize(list(str)).collect()
could use python rdd_text = sc.parallelize(list(str))
so it wont store the result as a List.
from pyspark import SparkConf
from pyspark import SparkContext
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
string = "test string"
adls_path="abfss://data@xxxxxxxx.dfs.core.windows.net/symbolexcel.xlsx"
text_path = adls_path + 'test.xlsx'
rdd_text = sc.parallelize(list(string))
rdd_text.saveAsTextFile(text_path)