Search code examples
javahadoopstring-matchingboyer-moore

How to implement string matching algorithm with Hadoop?


I want to implement a string matching(Boyer-Moore) algorithm using Hadoop. I just started using Hadoop so I have no idea how to write a Hadoop program in Java.

All the sample programs that I have seen so far are word counting examples and I couldn't find any sample programs for string matching.

I tried searching for some tutorials that teaches how to write Hadoop applications using Java but couldn't find any. Can you suggest me some tutorials where I can learn how to write Hadoop applications using Java.

Thanks in advance.


Solution

  • I haven't tested the below code, But this should get you started. I have used the BoyerMoore implementation available here

    What the below code is doing:

    The goal is to search for a pattern in an input document. The BoyerMoore class is initialized in the setup method using the pattern set in the configuration.

    The mapper receives each line at a time and it uses the BoyerMoore instance to find the pattern. If match is found, the we write it using context.

    There is no need of a reducer here. If the pattern is found multiple times in different mapper then the output will have multiple offsets(1 per mapper).

    package hadoop.boyermoore;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class BoyerMooreImpl {
    
    
          public static class TokenizerMapper
               extends Mapper<Object, Text, Text, IntWritable>{
            private BoyerMoore boyerMoore;
            private static IntWritable offset;
            private Text offsetFound = new Text("offset");
    
            public void map(Object key, Text value, Context context
                            ) throws IOException, InterruptedException {
              StringTokenizer itr = new StringTokenizer(value.toString());
              while (itr.hasMoreTokens()) {
                  String line = itr.nextToken();
                  int offset1 = boyerMoore.search(line);
                  if (line.length() != offset1) {
                      offset = new IntWritable(offset1);
                      context.write(offsetFound,offset);
                  }
              }
            }
            @Override
            public final void setup(Context context) {
                if (boyerMoore == null)
                    boyerMoore = new BoyerMoore(context.getConfiguration().get("pattern"));
            }
          }
    
    
          public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            conf.set("pattern","your_pattern_here");
            Job job = Job.getInstance(conf, "BoyerMoore");
            job.setJarByClass(BoyerMooreImpl.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
          }
    }