Search code examples
pythonpython-3.xdataframeapache-sparkpyspark

compare columns with list of string values in two different df and return ID which has the highest match between the lists


I have a pyspark dataframe 'jobs' like this:

    jobs=
id    position  keywords
5663123 A   ["Engineer","Quality"]
5662986 B   ['Java']
5663237 C   ['Art', 'Paint', 'Director']
5663066 D   ["Junior","Motion","Designer"]
5663039 E   ['Junior', 'Designer']
5663153 F   ["Client","Specialist"]
5663266 G   ['Pyhton']

And I have another dataframe named 'people' as:

people=

pid           skills
5662321 ["Engineer","L2"]
5663383 ["Quality","Engineer","L2"]
5662556 ["Art","Director"]
5662850 ["Junior","Motion","Designer"]
5662824 ['Designer', 'Craft', 'Junior']
5652496 ["Client","Support","Specialist"]
5662949 ["Community","Manager"]

I want to do is match the list values of people['skills'] with jobs['keywords']

If the match is more than 2 tokens, i.e. len(list(set(A)-set(B))) >=2 then return the ID of that particular job from jobs's table jobs['id'] in a new column in people['match'] in a list because there could be more than one matches, None otherwise.

The final people df should look like:

people=
pid           skills                         match
5662321 ["Engineer","L2"]                    None
5663383 ["Quality","Engineer","L2"]         [5663123]
5662556 ["Art","Director"]                  [5663237]
5662850 ["Junior","Motion","Designer"]  [5663066,5663039]
5662824 ['Designer', 'Craft', 'Junior'] [5663066,5663039]
5652496 ["Client","Support","Specialist"]   [5663153]
5662949 ["Community","Manager"]              None

I currently have a solution in place which is not efficient at all. Rightnow I iterate over spark dataframes row-wise which is taking a lot time for a large df.

I am open to pandas solutions as well.


Solution

  • This would work:

    people.join(jobs, F.expr("size(array_intersect(skills, keywords))>=2"), "left")\
    .groupBy("pid")\
    .agg(F.first("skills").alias("skills"), F.collect_list("id").alias("match"))\
    .withColumn("match", F.when(F.size("match")==0, None).otherwise(F.col("match")))\
    .show(truncate=False)
    

    The withColumn on line 4 is to just change empty lists to None as stated in the requirement.

    Let me know if you face any issue.

    Input:

    Jobs DF:

    +-------+--------+--------------------------+
    |id     |position|keywords                  |
    +-------+--------+--------------------------+
    |5663123|A       |[Engineer, Quality]       |
    |5662986|B       |[Java]                    |
    |5663237|C       |[Art, Paint, Director]    |
    |5663066|D       |[Junior, Motion, Designer]|
    |5663039|E       |[Junior, Designer]        |
    +-------+--------+--------------------------+
    

    People DF:

    +-------+--------------------------+
    |pid    |skills                    |
    +-------+--------------------------+
    |5662321|[Engineer, L2]            |
    |5663383|[Quality, Engineer, L2]   |
    |5662556|[Art, Director]           |
    |5662850|[Junior, Motion, Designer]|
    +-------+--------------------------+
    

    Output:

    +-------+--------------------------+------------------+
    |pid    |skills                    |match             |
    +-------+--------------------------+------------------+
    |5662321|[Engineer, L2]            |null              |
    |5663383|[Quality, Engineer, L2]   |[5663123]         |
    |5662556|[Art, Director]           |[5663237]         |
    |5662850|[Junior, Motion, Designer]|[5663066, 5663039]|
    +-------+--------------------------+------------------+