apache-spark hadoop batch-processing levenshtein-distance

Levenshtein distance algorithm on Spark

I'm starting with Hadoop ecosystem and I'm facing some questions and need your help.

I have two HDFS files and need to execute Levenshtein distance between a group of columns of the first one versus another group of the second one.

This process will be executed each day with a quite considerable amount of data (150M rows in the first file Vs 11M rows in the second one).

I will appreciate to have some guidance (code example, references, etc) on how I can read my two files from HDFS execute Levenshtein distance (using Spark?) as described and save the results on a third HDFS file.

Thank you very much in advance.

Solution

I guess you have csv file so you can read the directly to the dataframe:

val df1 =  spark.read.option("header","true").csv("hdfs:///pathtoyourfile_1")

The spark.sql.functions module conatins deflevenshtein(l: Column, r: Column): Column function so you need to pass as a parameter - dataframe column with String type, if you want to pass a group of columns you can take concat('col1,'col2,..) function to concatenate multiple columns and pass them to the previous function. If you have 2 or more dataframes you have to join them into one dataframe and then perform distance calculation. Finally you can save your results to csv using df.write.csv("path")