I have a pyspark dataframe 'jobs' like this:
jobs=
id position keywords
5663123 A ["Engineer","Quality"]
5662986 B ['Java']
5663237 C ['Art', 'Paint', 'Director']
5663066 D ["Junior","Motion","Designer"]
5663039 E ['Junior', 'Designer']
5663153 F ["Client","Specialist"]
5663266 G ['Pyhton']
And I have another dataframe named 'people' as:
people=
pid skills
5662321 ["Engineer","L2"]
5663383 ["Quality","Engineer","L2"]
5662556 ["Art","Director"]
5662850 ["Junior","Motion","Designer"]
5662824 ['Designer', 'Craft', 'Junior']
5652496 ["Client","Support","Specialist"]
5662949 ["Community","Manager"]
I want to do is match the list values of people['skills'] with jobs['keywords']
If the match is more than 2 tokens, i.e. len(list(set(A)-set(B))) >=2 then return the ID of that particular job from jobs's table jobs['id'] in a new column in people['match'] in a list because there could be more than one matches, None otherwise.
The final people df should look like:
people=
pid skills match
5662321 ["Engineer","L2"] None
5663383 ["Quality","Engineer","L2"] [5663123]
5662556 ["Art","Director"] [5663237]
5662850 ["Junior","Motion","Designer"] [5663066,5663039]
5662824 ['Designer', 'Craft', 'Junior'] [5663066,5663039]
5652496 ["Client","Support","Specialist"] [5663153]
5662949 ["Community","Manager"] None
I currently have a solution in place which is not efficient at all. Rightnow I iterate over spark dataframes row-wise which is taking a lot time for a large df.
I am open to pandas solutions as well.
This would work:
people.join(jobs, F.expr("size(array_intersect(skills, keywords))>=2"), "left")\
.groupBy("pid")\
.agg(F.first("skills").alias("skills"), F.collect_list("id").alias("match"))\
.withColumn("match", F.when(F.size("match")==0, None).otherwise(F.col("match")))\
.show(truncate=False)
The withColumn on line 4 is to just change empty lists to None as stated in the requirement.
Let me know if you face any issue.
Input:
Jobs DF:
+-------+--------+--------------------------+
|id |position|keywords |
+-------+--------+--------------------------+
|5663123|A |[Engineer, Quality] |
|5662986|B |[Java] |
|5663237|C |[Art, Paint, Director] |
|5663066|D |[Junior, Motion, Designer]|
|5663039|E |[Junior, Designer] |
+-------+--------+--------------------------+
People DF:
+-------+--------------------------+
|pid |skills |
+-------+--------------------------+
|5662321|[Engineer, L2] |
|5663383|[Quality, Engineer, L2] |
|5662556|[Art, Director] |
|5662850|[Junior, Motion, Designer]|
+-------+--------------------------+
Output:
+-------+--------------------------+------------------+
|pid |skills |match |
+-------+--------------------------+------------------+
|5662321|[Engineer, L2] |null |
|5663383|[Quality, Engineer, L2] |[5663123] |
|5662556|[Art, Director] |[5663237] |
|5662850|[Junior, Motion, Designer]|[5663066, 5663039]|
+-------+--------------------------+------------------+