Search code examples
regexparsingmapreducehadoop2

How to handle redundant cases in regex?


I have to parse a file data into good and bad records the data should be of format

Patient_id::Patient_name (year of birth)::disease

The diseases are pipe separated and are selected from the following:

1.HIV
2.Cancer
3.Flu
4.Arthritis 
5.OCD

Example: 23::Alex.jr (1969)::HIV|Cancer|flu

The regex expression I have written is

\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD) 
     (\|(HIV|Cancer|flu|Arthritis|OCD))*

But it's also considering the records with redundant entries

24::Robin (1980)::HIV|Cancer|Cancer|HIV

How to handle these kind of records and how to write a better expression if the list of diseases is very large.

Note: I am using hadoop maponly job for parsing so give answer in context with java.


Solution

  • What you might do is capture the last part with al the diseases in one group (named capturing group disease) and then use split to get the individual ones and then make the list unique.

    ^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$

    For example:

    String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
    String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
    
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(string);
    
    if (matcher.find()) {
        String[] parts =  matcher.group("disease").split("\\|");
        Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
        System.out.println(uniqueDiseases);
    }
    

    Result:

    [HIV, Cancer]
    

    Regex demo | Java demo