Search code examples
pythonapache-sparkapache-zeppelin

Convert pandas dataframe to RDD in zeppelin


I'm new to Zeppelin and I there are things I just don't understand.

I have downloaded a table from a db with python, then, I would like to convert it to an RDD. But I got the error that the table is not found. I think there's a problem founding the tables created with another interpreters but I don't realy know... I tried with this and this question but still don't work, they create the df directly with spark. Any help would be so useful :)

 %python
    engine = create_engine(
        'mysql+mysqlconnector://...')
    df = pd.read_sql(query, engine)

%spark
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._
df=df.registerTempTable("df")

val df = sqlContext.sql("SELECT * from df LIMIT 5")
df.collect().foreach(println)

Solution

  • Converting a Pandas DataFrame to a Spark DataFrame is quite straight-forward :

    %python
    import pandas
    
    pdf = pandas.DataFrame([[1, 2]]) # this is a dummy dataframe
    
    # convert your pandas dataframe to a spark dataframe
    df = sqlContext.createDataFrame(pdf)
    
    # you can register the table to use it across interpreters
    df.registerTempTable("df")
    
    # you can get the underlying RDD without changing the interpreter 
    rdd = df.rdd
    

    To fetch it with scala spark you'll just need to do the following :

    %spark
    val df = sqlContext.sql("select * from df")
    df.show()
    // +---+---+
    // |  0|  1|
    // +---+---+
    // |  1|  2|
    // +---+---+
    

    You can also get the underlying rdd :

    val rdd = df.rdd