>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> rdd =sqlContext.jsonFile("tmp.json")
>>> rdd_new= rdd.map(lambda x:x.name,x.age)
Its working properly.But there is list of values list1=["name","age","gene","xyz",.....] When I am passing
For each_value in list1:
`rdd_new=rdd.map(lambda x:x.each_value)` I am getting error
I think what you need is to pass on the name of fields you want to select. In that case, see the following:
r1 = ssc.jsonFile("test.json")
r1.printSchema()
r1.show()
l1 = ['number','string']
s1 = r1.select(*l1)
s1.printSchema()
s1.show()
root
|-- array: array (nullable = true)
| |-- element: long (containsNull = true)
|-- boolean: boolean (nullable = true)
|-- null: string (nullable = true)
|-- number: long (nullable = true)
|-- object: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- c: string (nullable = true)
| |-- e: string (nullable = true)
|-- string: string (nullable = true)
array boolean null number object string
ArrayBuffer(1, 2, 3) true null 123 [b,d,f] Hello World
root
|-- number: long (nullable = true)
|-- string: string (nullable = true)
number string
123 Hello World
This is done through a Dataframe. Note the way arg list is passed. For more, you can see this link