Search code examples
pysparkapache-spark-sqldatabricks

Using .isin on columns to categorize data


I am working in Databricks and fairly new to PySpark. I have a table we pulled in from SQL with several columns whose values we've concatenated into a single column called edu_hist which we then want to take and in a new column called edu_level label it with the corresponding education level.

I am attempting to use .isin to achieve this, but my code is just labeling everything UNKNOWN. Where am I making a misstep?

original table (df)

| id  | edu_hist     |
|-----|--------------|
| 123 | , , T, 3, 6  |
| 456 | , 7, X, 4, , |
| 789 | , 3, S, 2, , |

.isin code to label

df = df.withColumn('edu_level', 
  F.when((F.col('edu_hist').isin('7', '8', '9', 'Z')), 'GRADUATE DEGREE') 
   .when((F.col('edu_hist').isin('5', '6', 'W', 'X', 'Y')), 'COLLEGE DEGREE') 
   .when((F.col('edu_hist').isin('3', '4', 'V')), 'SOME COLLEGE') 
   .when((F.col('edu_hist').isin('S', 'T', 'U')), 'HIGH SCHOOL OR EQUIVALENT') 
   .when((F.col('edu_hist').isin('0', '1', '2', 'N', 'O', 'P', 'Q', 'R')), 'OTHER').otherwise('UNKNOWN')
  )

expected output

| id  | edu_hist     | edu_level       |
|-----|--------------|-----------------|
| 123 | , , T, 3, 6  | COLLEGE DEGREE  |
| 456 | , 7, X, 4, , | GRADUATE DEGREE |
| 789 | , 3, S, 2, , | SOME COLLEGE    |

actual output

| id  | edu_hist     | edu_level |
|-----|--------------|-----------|
| 123 | , , T, 3, 6  | UNKNOWN   |
| 456 | , 7, X, 4, , | UNKNOWN   |
| 789 | , 3, S, 2, , | UNKNOWN   |

Solution

  • It can be several factors, but the mains are:

    1 - You're trying to check if a string is in a list o chars, Spark will check if the entire string is in the list, if only part of the string is on the list, will result in the otherwise().

    Example: Imagine you have this string my_str = "A B C D" and you do something like that .when(F.col("my_str").isin("A", "B", "C", "D"), "True").otherwise("False"). It is gonna result in "False", because the entire string is not contained on the list inside "isin", just parts. But if it was: .when(F.col("my_str").isin("A B C D"), "True").otherwise("False") the result would be "True" because the entire string is in the list inside "isin".

    Solution: You could regular expression to extract and check your data, code:

    df = df_string.withColumn('edu_level', 
        F.when(F.col('edu_hist').rlike('(?i).*\\b(7|8|9|Z)\\b.*'), 'GRADUATE DEGREE')
        .when(F.col('edu_hist').rlike('(?i).*\\b(5|6|W|X|Y)\\b.*'), 'COLLEGE DEGREE')
        .when(F.col('edu_hist').rlike('(?i).*\\b(3|4|V)\\b.*'), 'SOME COLLEGE')
        .when(F.col('edu_hist').rlike('(?i).*\\b(S|T|U)\\b.*'), 'HIGH SCHOOL OR EQUIVALENT')
        .when(F.col('edu_hist').rlike('(?i).*\\b(0|1|2|N|O|P|Q|R)\\b.*'), 'OTHER')
        .otherwise('UNKNOWN')
    )
    

    2 - Maybe your column is not StringType, maybe it is a struct or array of strings, you need to do something different.

    Solution: 2.1 - If it is an array of strings, just use array_contains function, code:

    map_results: dict = {
        "GRADUATE DEGREE": (
                F.array_contains(F.col('edu_hist'), '7') | 
                F.array_contains(F.col('edu_hist'), '8') | 
                F.array_contains(F.col('edu_hist'), '9') | 
                F.array_contains(F.col('edu_hist'), 'Z')
            ),
        "COLLEGE DEGREE": (
                F.array_contains(F.col('edu_hist'), '5') | 
                F.array_contains(F.col('edu_hist'), '6') | 
                F.array_contains(F.col('edu_hist'), 'W') | 
                F.array_contains(F.col('edu_hist'), 'X') | 
                F.array_contains(F.col('edu_hist'), 'Y')
            ),
        "SOME COLLEGE": (
                F.array_contains(F.col('edu_hist'), '3') | 
                F.array_contains(F.col('edu_hist'), '4') | 
                F.array_contains(F.col('edu_hist'), 'V')
            ),
        "HIGH SCHOOL OR EQUIVALENT": (
                F.array_contains(F.col('edu_hist'), 'S') | 
                F.array_contains(F.col('edu_hist'), 'T') | 
                F.array_contains(F.col('edu_hist'), 'U')
            ),
        "OTHER": (
                F.array_contains(F.col('edu_hist'), '0') | 
                F.array_contains(F.col('edu_hist'), '1') | 
                F.array_contains(F.col('edu_hist'), '2') | 
                F.array_contains(F.col('edu_hist'), 'N') | 
                F.array_contains(F.col('edu_hist'), 'O') | 
                F.array_contains(F.col('edu_hist'), 'P') | 
                F.array_contains(F.col('edu_hist'), 'Q') | 
                F.array_contains(F.col('edu_hist'), 'R')
            )
    }
    df = df_array \
        .withColumn(
            'edu_level', 
            F.when(map_results.get("GRADUATE DEGREE"), "GRADUATE DEGREE")
            .when(map_results.get("COLLEGE DEGREE"), "COLLEGE DEGREE")
            .when(map_results.get("SOME COLLEGE"), "SOME COLLEGE")
            .when(map_results.get("HIGH SCHOOL OR EQUIVALENT"), "HIGH SCHOOL OR EQUIVALENT")
            .when(map_results.get("OTHER"), "OTHER")
            .otherwise('UNKNOWN')
    )
    

    2.2 - If it is a struct, you just need to concatenate the struct field into one string and do the same thing as in the first solution, code:

    df = df_struct.withColumn('edu_hist_concat', 
                       F.concat_ws(',', F.col('edu_hist.part1'), F.col('edu_hist.part2'),
                                   F.col('edu_hist.part3'), F.col('edu_hist.part4'), F.col('edu_hist.part5')))
    
    df = df.withColumn('edu_level', 
                       F.when(F.col('edu_hist_concat').rlike('(?i).*\\b(7|8|9|Z)\\b.*'), 'GRADUATE DEGREE')
                        .when(F.col('edu_hist_concat').rlike('(?i).*\\b(5|6|W|X|Y)\\b.*'), 'COLLEGE DEGREE')
                        .when(F.col('edu_hist_concat').rlike('(?i).*\\b(3|4|V)\\b.*'), 'SOME COLLEGE')
                        .when(F.col('edu_hist_concat').rlike('(?i).*\\b(S|T|U)\\b.*'), 'HIGH SCHOOL OR EQUIVALENT')
                        .when(F.col('edu_hist_concat').rlike('(?i).*\\b(0|1|2|N|O|P|Q|R)\\b.*'), 'OTHER')
                        .otherwise('UNKNOWN')
    )
    

    I checked all the code and it's working fine, try and tell me if it's running or not. If you like this answer, let me know too!

    I used resources mainly: https://spark.apache.org/docs/latest/sql-getting-started.html and https://regex101.com/