Search code examples
apache-sparkapache-spark-sqlpysparksparklines

How to pass list of values, json pyspark


 >>> from pyspark.sql import SQLContext
 >>> sqlContext = SQLContext(sc)
 >>> rdd =sqlContext.jsonFile("tmp.json") 
 >>> rdd_new= rdd.map(lambda x:x.name,x.age) 

Its working properly.But there is list of values list1=["name","age","gene","xyz",.....] When I am passing

 For each_value in list1:
     `rdd_new=rdd.map(lambda x:x.each_value)` I am getting error

Solution

  • I think what you need is to pass on the name of fields you want to select. In that case, see the following:

    r1 = ssc.jsonFile("test.json")
        r1.printSchema()
        r1.show()
    
        l1 = ['number','string']
        s1 = r1.select(*l1)
        s1.printSchema()
        s1.show()
    
    root
     |-- array: array (nullable = true)
     |    |-- element: long (containsNull = true)
     |-- boolean: boolean (nullable = true)
     |-- null: string (nullable = true)
     |-- number: long (nullable = true)
     |-- object: struct (nullable = true)
     |    |-- a: string (nullable = true)
     |    |-- c: string (nullable = true)
     |    |-- e: string (nullable = true)
     |-- string: string (nullable = true)
    
    array                boolean null number object  string     
    ArrayBuffer(1, 2, 3) true    null 123    [b,d,f] Hello World
    root
     |-- number: long (nullable = true)
     |-- string: string (nullable = true)
    
    number string     
    123    Hello World
    

    This is done through a Dataframe. Note the way arg list is passed. For more, you can see this link