Search code examples
pythonpysparkfeature-extractionfeature-selection

Keep Feature Definitions in Dictionary and Rerturn the feature to the client


I have below dictionary for keeping feature definitions as strings.

    features = {
  "journey_email_been_sent_flag": "F.when(F.col('email_14days') > 0,F.lit(1)).otherwise(F.lit(0))",
  "journey_opened_flag": "F.when(F.col('opened_14days') > 0, F.lit(1)).otherwise(F.lit(0))"
}
retrieved_features = {}
non_retrieved_features = {}

Or keeping it as definition itself.

    features = {
  "journey_email_been_sent_flag": F.when(F.col('email_14days') > 0,F.lit(1)).otherwise(F.lit(0)),
  "journey_opened_flag": F.when(F.col('opened_14days') > 0, F.lit(1)).otherwise(F.lit(0))
}

Then below code for retrieving the feature definitions

 def feature_extract(*featurenames):
  for featurename in featurenames:
    if featurename in features:
      print(f"{featurename} : {features[featurename]}")
      retrieved_features[featurename] = features[featurename]
    else:
      print('failure')
      non_retrieved_features[featurename] = "Not Found in the feature defenition"
  return retrieved_features

And this is how I call the function for retrieving the features

feature_extract('journey_email_been_sent_flag','journey_opened_flag')

However its not working when I am trying to retrieve the future , i receive the below result when keeping the definition in dictionary

Out[19]: {'journey_email_been_sent_flag': Column<b'CASE WHEN (email_14days > 0) THEN 1 ELSE 0 END'>}

when i call the retrieval of feature as below in the dataframe.

.withColumn('journey_email_been_sent_flag', feature_extract('journey_email_been_sent_flag'))

getting below error

AssertionError: col should be Column

Solution

  • I could fix it by this way

    I keep the feature definition as definitions

        features = {
      "journey_email_been_sent_flag": F.when(F.col('email_14days') > 0,F.lit(1)).otherwise(F.lit(0)),
      "journey_opened_flag": F.when(F.col('opened_14days') > 0, F.lit(1)).otherwise(F.lit(0))
    }
    

    And call the feature_extract function using F.lit

    F.lit(feature_extract('journey_email_been_sent_flag').get('journey_email_been_sent_flag'))