Search code examples
pythonapache-sparkpysparkapache-spark-sqlaws-glue

How to use LIKE operator as a JOIN condition in pyspark as a column


I would like to do the following in pyspark (for AWS Glue jobs):

JOIN a and b ON a.name = b.name AND a.number= b.number AND a.city LIKE b.city

So for example:

Table a:

Number Name City
1000 Bob %
2000 Joe London

Table b:

Number Name City
1000 Bob Boston
1000 Bob Berlin
2000 Joe Paris

Results

Number Name City
1000 Bob Boston
1000 Bob Berlin

So the part I don't know how to do is to implement the wildcard "%" and use the LIKE operator. I know you can use .like() on strings, for example:

df.where(col('col1').like("%string%")).show()

But it expects a string, where in my case I would like to do it as a column. Something like the following:

result = a.join(
    b,
    (a.name == b.name) &
    (a.number == b.number) &
    (a.city.like(b.city)) # <-- This doesnt work since it is not a string

Any help to do this will be very appreciated!


Solution

  • Try using an expression:

    import pyspark.sql.functions as F
    
    result = a.alias('a').join(
        b.alias('b'),
        (a.name == b.name) &
        (a.number == b.number) &
        F.expr("b.city like a.city")
    )
    

    I think you meant to do b like a rather than a like b because the % is in table a.