I am working in Databricks and fairly new to PySpark. I have a table we pulled in from SQL with several columns whose values we've concatenated into a single column called edu_hist
which we then want to take and in a new column called edu_level
label it with the corresponding education level.
I am attempting to use .isin
to achieve this, but my code is just labeling everything UNKNOWN
. Where am I making a misstep?
original table (df
)
| id | edu_hist |
|-----|--------------|
| 123 | , , T, 3, 6 |
| 456 | , 7, X, 4, , |
| 789 | , 3, S, 2, , |
.isin
code to label
df = df.withColumn('edu_level',
F.when((F.col('edu_hist').isin('7', '8', '9', 'Z')), 'GRADUATE DEGREE')
.when((F.col('edu_hist').isin('5', '6', 'W', 'X', 'Y')), 'COLLEGE DEGREE')
.when((F.col('edu_hist').isin('3', '4', 'V')), 'SOME COLLEGE')
.when((F.col('edu_hist').isin('S', 'T', 'U')), 'HIGH SCHOOL OR EQUIVALENT')
.when((F.col('edu_hist').isin('0', '1', '2', 'N', 'O', 'P', 'Q', 'R')), 'OTHER').otherwise('UNKNOWN')
)
expected output
| id | edu_hist | edu_level |
|-----|--------------|-----------------|
| 123 | , , T, 3, 6 | COLLEGE DEGREE |
| 456 | , 7, X, 4, , | GRADUATE DEGREE |
| 789 | , 3, S, 2, , | SOME COLLEGE |
actual output
| id | edu_hist | edu_level |
|-----|--------------|-----------|
| 123 | , , T, 3, 6 | UNKNOWN |
| 456 | , 7, X, 4, , | UNKNOWN |
| 789 | , 3, S, 2, , | UNKNOWN |
It can be several factors, but the mains are:
1 - You're trying to check if a string is in a list o chars, Spark will check if the entire string is in the list, if only part of the string is on the list, will result in the otherwise().
Example:
Imagine you have this string my_str = "A B C D"
and you do something like that .when(F.col("my_str").isin("A", "B", "C", "D"), "True").otherwise("False")
. It is gonna result in "False", because the entire string is not contained on the list inside "isin", just parts. But if it was: .when(F.col("my_str").isin("A B C D"), "True").otherwise("False")
the result would be "True" because the entire string is in the list inside "isin".
Solution: You could regular expression to extract and check your data, code:
df = df_string.withColumn('edu_level',
F.when(F.col('edu_hist').rlike('(?i).*\\b(7|8|9|Z)\\b.*'), 'GRADUATE DEGREE')
.when(F.col('edu_hist').rlike('(?i).*\\b(5|6|W|X|Y)\\b.*'), 'COLLEGE DEGREE')
.when(F.col('edu_hist').rlike('(?i).*\\b(3|4|V)\\b.*'), 'SOME COLLEGE')
.when(F.col('edu_hist').rlike('(?i).*\\b(S|T|U)\\b.*'), 'HIGH SCHOOL OR EQUIVALENT')
.when(F.col('edu_hist').rlike('(?i).*\\b(0|1|2|N|O|P|Q|R)\\b.*'), 'OTHER')
.otherwise('UNKNOWN')
)
2 - Maybe your column is not StringType, maybe it is a struct or array of strings, you need to do something different.
Solution: 2.1 - If it is an array of strings, just use array_contains function, code:
map_results: dict = {
"GRADUATE DEGREE": (
F.array_contains(F.col('edu_hist'), '7') |
F.array_contains(F.col('edu_hist'), '8') |
F.array_contains(F.col('edu_hist'), '9') |
F.array_contains(F.col('edu_hist'), 'Z')
),
"COLLEGE DEGREE": (
F.array_contains(F.col('edu_hist'), '5') |
F.array_contains(F.col('edu_hist'), '6') |
F.array_contains(F.col('edu_hist'), 'W') |
F.array_contains(F.col('edu_hist'), 'X') |
F.array_contains(F.col('edu_hist'), 'Y')
),
"SOME COLLEGE": (
F.array_contains(F.col('edu_hist'), '3') |
F.array_contains(F.col('edu_hist'), '4') |
F.array_contains(F.col('edu_hist'), 'V')
),
"HIGH SCHOOL OR EQUIVALENT": (
F.array_contains(F.col('edu_hist'), 'S') |
F.array_contains(F.col('edu_hist'), 'T') |
F.array_contains(F.col('edu_hist'), 'U')
),
"OTHER": (
F.array_contains(F.col('edu_hist'), '0') |
F.array_contains(F.col('edu_hist'), '1') |
F.array_contains(F.col('edu_hist'), '2') |
F.array_contains(F.col('edu_hist'), 'N') |
F.array_contains(F.col('edu_hist'), 'O') |
F.array_contains(F.col('edu_hist'), 'P') |
F.array_contains(F.col('edu_hist'), 'Q') |
F.array_contains(F.col('edu_hist'), 'R')
)
}
df = df_array \
.withColumn(
'edu_level',
F.when(map_results.get("GRADUATE DEGREE"), "GRADUATE DEGREE")
.when(map_results.get("COLLEGE DEGREE"), "COLLEGE DEGREE")
.when(map_results.get("SOME COLLEGE"), "SOME COLLEGE")
.when(map_results.get("HIGH SCHOOL OR EQUIVALENT"), "HIGH SCHOOL OR EQUIVALENT")
.when(map_results.get("OTHER"), "OTHER")
.otherwise('UNKNOWN')
)
2.2 - If it is a struct, you just need to concatenate the struct field into one string and do the same thing as in the first solution, code:
df = df_struct.withColumn('edu_hist_concat',
F.concat_ws(',', F.col('edu_hist.part1'), F.col('edu_hist.part2'),
F.col('edu_hist.part3'), F.col('edu_hist.part4'), F.col('edu_hist.part5')))
df = df.withColumn('edu_level',
F.when(F.col('edu_hist_concat').rlike('(?i).*\\b(7|8|9|Z)\\b.*'), 'GRADUATE DEGREE')
.when(F.col('edu_hist_concat').rlike('(?i).*\\b(5|6|W|X|Y)\\b.*'), 'COLLEGE DEGREE')
.when(F.col('edu_hist_concat').rlike('(?i).*\\b(3|4|V)\\b.*'), 'SOME COLLEGE')
.when(F.col('edu_hist_concat').rlike('(?i).*\\b(S|T|U)\\b.*'), 'HIGH SCHOOL OR EQUIVALENT')
.when(F.col('edu_hist_concat').rlike('(?i).*\\b(0|1|2|N|O|P|Q|R)\\b.*'), 'OTHER')
.otherwise('UNKNOWN')
)
I checked all the code and it's working fine, try and tell me if it's running or not. If you like this answer, let me know too!
I used resources mainly: https://spark.apache.org/docs/latest/sql-getting-started.html and https://regex101.com/