Search code examples
pythonmysqlsqlapache-zeppelin

Zeppelin: What the best way to query data with SQL and work with it?


I want to use Zeppelin to query databases. I currently see two possibilities but none of them is sufficient for me:

  1. Configure a database connection as "interpreter", name it e.g. "sql1", use it in a paragraph, run a sql query and use the inbuilt nice plotting tools. It seems that all the tutorials and tips deal with it but then the documentation suddenly stops! But I want to do more with the data: I want to filter and process. If I want to plot it again (with other limitations), I have to do the query (that may last some seconds or minutes) again (see my other question Zeppelin SQL: reuse data of query without another interpreter or a new query)
  2. Use spark with python, scala or similar. But the documentation seems only to load csv data, put in into a dataframe and then accesses this dataframe with sql. There is no accessing the data with sql in the first place. How do I access the sql data the best way? Can I use a already configured "interpreter" (database connection)?

Solution

  • I cannot find a solution for 1. But I have made a short solution for 2. that works within zeppelin with python (2.7), sqlalchemy (sql wrapper), mysqldb (mysql implementation) and pandas (make sure that have these packages installed, all of them are in Debian 9). I wonder why I have not found such a solution before...

    %python
    from sqlalchemy import create_engine
    import pandas as pd
    
    sql = "select col1, col2 from table limit 10"
    df = pd.read_sql(sql,
    create_engine('mysql+mysqldb://user:password@host:3306/database').connect())
    
    z.show(df)
    

    If you want to connect to another database like db2 or oracle, you have to use other python packages and adjust the first part in the create_engine string.