Search code examples
apache-sparkpysparktransformer-model

Spark The Definitive Guide: Chapter 25 - Preprocessing and Feature Engineering


I do not understand when to use both 'fit' and 'transform' versus when to use 'transform' only.

The following transformers use both fit and transform:

  • Rformula
  • QuantileDiscretizer
  • StandardScaler
  • MinMaxScaler
  • MaxAbsScaler
  • StringIndexer
  • VectorIndexer
  • CountVectorizer
  • PCA
  • ChiSqSelector

The following transformers only use transform:

  • SQLTransformer
  • VectorAssembler
  • Bucketizer
  • ElementWiseProduct
  • Normalizer
  • IndexToString
  • OneHotEncoder
  • Tokenizer
  • RegexTokenizer
  • StopWordsRemover
  • NGram

I don't understand intuitively when to use both fit and transform versus when to use transform only.

Kindly explain. Thanks.


Solution

  • Ultimately, all of these components are there to 'transform' data, to index / scale / bucketize, etc. Some of them do not need to know anything about the data to do their work. For example, StopWordsRemover just applies a list of stop words to remove, regardless of the data.

    Some components do need to understand the data they're operating on in order to transform it correctly. For example MinMaxScaler needs to know the min/max of the data to perform the scaling.

    So, all of these expose a transform() method, but only some need to be fit() on the data first.