Search code examples
pysparkpython-import

Having to import PySpark classes/methods in very piecemeal fashion


I am spinning up on Python and PySpark, installed using Anaconda on Windows 10. For now, I'm working through sparkbyexamples.com pages, e.g., here, here, here.

I'm surprised by how many classes and methods need to be imported in a piecemeal fashion, e.g., SparkSession, StructType,StructField, StringType, IntegerType, Row, col, Column, etc. Not all the imports are specified in the tutorial material, so one has to recursively search the *.py files in the %SPARK_HOME% subtree to find them, e.g., using find, sed, and/or vimgrep. This is not efficient.

I would expected that for efficient analytics, many classes and methods that are used in a particular application domain would be accessible in a single import, or a few imports. How do Python users avoid the need to go hunting for the right classes/methods and importing them in a piecemeal fashion?


Solution

  • To avoid importing multiple symbols you can directly import a module/namespace and then use that namespace to access the required symbols. This will make your imports cleaner and more concise.

    from pyspark.sql import types as T, functions as F
    
    schema = T.StructType([
        T.StructField('firstname', T.StringType(), True),
        T.StructField('middlename', T.StringType(), True),
        T.StructField('lastname', T.StringType(), True)
    ])