Search code examples
javaapache-sparkapache-spark-sqldocument-databasebigdata

Spark Sql, unable to query multiple possible values in a array


I have the data schema of LinkeIn account as shown below. I need to query the skills which is in the for of array, where array may contains either JAVA OR java OR Java or JAVA developer OR Java developer.

Linkedin dataschema

Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"
            + " WHERE ARRAY_CONTAINS(skills,'Java') "
            + " OR ARRAY_CONTAINS(skills,'JAVA')"
            + " OR ARRAY_CONTAINS(skills,'Java developer') "
            + "AND ARRAY_CONTAINS(experience['description'],'Java developer')"  );

The above query is what i have tried and please suggest any better way.and also how to use case-insentive query ?


Solution

  • df.printschema()
    
    root
     |-- skills: array (nullable = true)
     |    |-- element: string (containsNull = true)
    
    
    df.show()
    
    +--------------------+
    |              skills|
    +--------------------+
    |        [Java, java]|
    |[Java Developer, ...|
    |               [dev]|
    +--------------------+
    

    Now lets register it as a temp table:

    >>> df.registerTempTable("t")
    

    Now, we will explode the array, convert each element as lower case and query using LIKE operator:

    >>> res = sqlContext.sql("select skills, lower(skill) as skill from (select skills, explode(skills) skill from t) a where lower(skill) like '%java%'")
    >>> res.show()
    +--------------------+--------------+
    |              skills|         skill|
    +--------------------+--------------+
    |        [Java, java]|          java|
    |        [Java, java]|          java|
    |[Java Developer, ...|java developer|
    |[Java Developer, ...|      java dev|
    +--------------------+--------------+
    

    Now, you can do a distinct on skills field.