Search code examples
apache-sparkapache-spark-sql

How can you update values in a dataset?


So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));

However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?

The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?


Solution

  • I found a way to go around this:

    You want to split your dataset into two sets: the values you want to update and the values you don't want to update

    Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
    Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
    
    valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
    
    Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
    

    This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.

    In PySpark you have to use .subtract instead of .except