Search code examples
pythonapache-sparkpyspark

AttributeError: 'NoneType' object has no attribute 'randomSplit'


I keep receiving an error when trying to randomSplit in pySpark.

I've added these dependencies:

#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark

#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

#Step 3: Initialize Pyspark
import findspark
findspark.init()

Created the pySpark environment:

#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('lr_example').getOrCreate()

and added these:

# Import VectorAssembler and Vectors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

However, every time I run this:

final_df = output.select("features", "medv").show()
train_data, test_data = final_df.randomSplit([0.7, 0.3])

I get this:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-e27b8ca71b51> in <cell line: 1>()
----> 1 train_data, test_data = final_df.randomSplit([0.7, 0.3])

AttributeError: 'NoneType' object has no attribute 'randomSplit'

Any ideas? I searched around for what needs to be imported and it seems I have everything but it won't load. Link to Github doc


Solution

  • you left out the only important line

    final_df = output.select("features", "medv").show()

    show prints the results but returns None ... so you are setting final_df to none

    instead

    final_df = output.select("features", "medv") # create df
    final_df.show() # print it