Search code examples
pythonmysqlpysparkgoogle-colaboratory

How to read & write Local MySQL Server 8 from Google Colab with Pyspark?


I have been trying but failing to write/read tables from MySQL Server 8.0.19 on localhost on Windows 10 with pyspark from Google colab. There's also a lot of similar questions and with some suggested answers but none of the solutions seem to work here. Here is my code:

    <...installations  ...>

        from pyspark.sql import SparkSession

        spark = SparkSession\
        .builder\
        .appName("Word Count")\
        .config("spark.driver.extraClassPath", "/content/spark-2.4.5-bin-hadoop2.7/jars/mysql-connector-java-8.0.19.jar")\
        .getOrCreate()

An here is the connection string:

MyjdbcDF = spark.read.format("jdbc")\
                    .option("url", "jdbc:mysql://127.0.0.1:3306/mydb?user=testuser&password=pwtest")\
                    .option("dbtable", "collisions")\
                    .option("driver","com.mysql.cj.jdbc.Driver")\
                    .load()

I have as well used the .option("driver","com.mysql.jdbc.Driver") but still keep getting this error:

Py4JJavaError: An error occurred while calling o154.load.
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.

...
...
...
Caused by: java.net.ConnectException: Connection refused (Connection refused)

From this, I guess that MySQL Sever is not reachable. I have Telnetted to port 3306 & it confirmed that MySQL Server is accepting connections from client machine. I have read that running: netsh advfirewall firewall add rule name="MySQL Server" action=allow protocol=TCP dir=in localport=3306 will permitting firewall rule for MySQL Server incase it was being blocked, yet no change.

Can somebody help outpy?


Solution

  • After about several days of trials I discovered a solution, that is why I am going to answer my own question. I was able to connect using a WAMP server (thanks to @Shubham Jain for suggesting) and as well as without a WAMP server. This answer is without WAMP server.

    Downloaded ngrok from https://ngrok.com/,
    Unzipped it,
    Saved it on my local windows,
    Authinticate with:
    ./ngrok authtoken xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    (pretty straight forward instructions available on the website)

    Still on my local Windows, I copied and ran ngrok tcp 3306 on the command line

    C:\Users\userMe> ngrok tcp 3306

    and it gave someting like:

    ngrok by @inconshreveable                                                             
    Session Status                online
    Account                       userMe (Plan: Free)
    Version                       2.3.35
    Region                        United States (us)
    Web Interface                 http://localhost:4041
    Forwarding                    tcp://0.tcp.ngrok.io:17992 -> localhost:3306
    
    Connections                   ttl     opn     rt1     rt5     p50     p90
                                  0       0       0.00    0.00    0.00    0.00
    

    Where 0.tcp.ngrok.io:17992 is the only thing I am interested in and where 3306 is MySQL's and only port I am interested at exposing to the internet to link with my Google Colab.

    So, at the end of the day, my PySpark READ connection will look like:

    jdbcDF = spark.read.format("jdbc")\
                        .option("url", "jdbc:mysql://0.tcp.ngrok.io:17992/mydb?user=testUser&password=pestpw")\
                        .option("dbtable", "pipeLineTable")\
                        .option("driver","com.mysql.cj.jdbc.Driver")\
                        .load(); 
    

    The WRITE connection will be:

    jdbcDF.write.mode("overwrite")\
        .format("jdbc")\
        .option("url",  f"jdbc:mysql://0.tcp.ngrok.io:17992/mydb")\
        .option("dbtable", "fromGcTable")\
        .option("user", "testUser")\
        .option("password", "testpw")\
        .option("driver","com.mysql.cj.jdbc.Driver")\
        .save()
    

    In both connection strings, note the 0.tcp.ngrok.io:17992 that replaces localhost:3306