Search code examples
aws-glue-spark

How to create a filter on an aws glue dynamicframe that filters out set of (literal) values


In a glue script (running in a zeppelin notebook forwarding to a dev endpoint in glue), I've created a dynamicframe from a glue table, that I would like to filter on field "name" not being in a static list of values, i.e. ("a","b","c").

Filtering on non-equality works just fine like this:

def unknownNameFilter(rec: DynamicRecord): Boolean = { 
   rec.getField("name").exists(_ != "a")
}

I have tried several things like

!rec.getField("name").exists(_ isin ("a","b","c"))

but it gives errors (value isin is not a member of Any), and I can only find pyspark examples and examples that first convert the dynamicframe to a dataframe on the web (which I want to prevent if possible).

Help much appreciated, thanks.


Solution

  • Okay, found my answer, I'll post it for anyone else looking for this, it is done with

    !(knownevents.contains(eventname))
    

    Like this in a filter function:

    def unknownEventFilter(rec: DynamicRecord): Boolean = { 
      
      val knownevents = List("evt_a","evt_b")    
         
      rec.getField("name") match {
     
        case Some(eventname: String) => !(knownevents.contains(eventname))
          
        case _ => throw new IllegalArgumentException(s"Unable to extract field name")
      }
    }
    
    val dfUnknownEvents =  df.filter(unknownEventFilter)