I keep receiving an error when trying to randomSplit in pySpark.
I've added these dependencies:
#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark
#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"
#Step 3: Initialize Pyspark
import findspark
findspark.init()
Created the pySpark environment:
#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()
and added these:
# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
However, every time I run this:
final_df = output.select("features", "medv").show()
train_data, test_data = final_df.randomSplit([0.7, 0.3])
I get this:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-76-e27b8ca71b51> in <cell line: 1>()
----> 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])
AttributeError: 'NoneType' object has no attribute 'randomSplit'
Any ideas? I searched around for what needs to be imported and it seems I have everything but it won't load. Link to Github doc
you left out the only important line
final_df = output.select("features", "medv").show()
show prints the results but returns None ... so you are setting final_df to none
instead
final_df = output.select("features", "medv") # create df
final_df.show() # print it