Search code examples
javaapache-sparkapache-spark-sqlapache-spark-mllib

Applying list of strings on like operator


problem statement: I need to pass a list of string and apply like operator using filter function in spark java.

Reason: like operator considers a single string entity, so there is lot of performance issue in the code as we need to apply on larger dataset, to overcome this problem I need to pass the list of string and apply like operator using filter function.

So kindly let me know how to apply list using like. as I need to find similar patterns of ID's

 JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
  SQLContext sqlContext = new SQLContext(sc);
  SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();

  List<Row> data = Arrays.asList(
      RowFactory.create("J40504", "CRC Industries"),
      RowFactory.create("K630-0746777","Dixon value"),
      RowFactory.create("K444-4444","3M INdustries"),
      RowFactory.create("4333444","3M INdustries"),
      RowFactory.create("566-655","3M INdustries"),
      RowFactory.create("4444888","3M INdustries"),
      RowFactory.create("P477-7444","3M INdustries"),
      RowFactory.create("566655","Dixon coupling valve"));
   // In real time we have large dataset

  StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
    new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });

  Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);

  List<String> listStrings = new ArrayList<String>();
  listStrings.add("40504");
  listStrings.add("630-0746");
  listStrings.add("477-7444");
  listStrings.add("444-4444");
// In real time we have large list of string to be compared with
  sentenceDataFrame.show();
  System.out.println("Array list :"+listStrings);
  for(int i=0;i<listStrings.size();i++){
 sentenceDataFrame=sentenceDataFrame.filter(col("label1").like("%"+listStrings.get(i)+"%"));
  }
  sentenceDataFrame.show();

Solution

  • First Solution

    Instead of iterate over the dataset N (where N is the size of listStrings) number of times you can build the expression and filter the dataset one time only:

        StringBuilder expressionBuilder = new StringBuilder();
    
        String separator = "";
    
        for (String s : listStrings) {
            expressionBuilder.append(separator + " label1 LIKE '%" + s + "%'");
            separator = " OR ";
        }
    
        String expression = expressionBuilder.toString();
    
        sentenceDataFrame = sentenceDataFrame.filter(expression);
    

    Second Solution

    We can load the listStrings into a Dataset:

     StructType schemaList = new StructType(new StructField[]{new StructField("labelToFind", DataTypes.StringType, false, Metadata.empty())});
    
     List<Row> listStrings = Arrays.asList(
                    RowFactory.create("40504"),
                    RowFactory.create("630-0746"),
                    RowFactory.create("477-7444"),
                    RowFactory.create("444-4444"));
    
     Dataset<Row>listDataset = sqlContext.createDataFrame(listStrings, schemaList);
    

    and then we can Join the two datasets in order to filter the rows:

       sentenceDataFrame = sentenceDataFrame.join(listDataset ,sentenceDataFrame.col("label1").contains(listDataset.col("labelToFind"))).select("label1","sentence1");