I have to parse a file data into good and bad records the data should be of format
Patient_id::Patient_name (year of birth)::disease
The diseases are pipe separated and are selected from the following:
1.HIV
2.Cancer
3.Flu
4.Arthritis
5.OCD
Example: 23::Alex.jr (1969)::HIV|Cancer|flu
The regex expression I have written is
\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(HIV|Cancer|flu|Arthritis|OCD)
(\|(HIV|Cancer|flu|Arthritis|OCD))*
But it's also considering the records with redundant entries
24::Robin (1980)::HIV|Cancer|Cancer|HIV
How to handle these kind of records and how to write a better expression if the list of diseases is very large.
Note: I am using hadoop maponly job for parsing so give answer in context with java.
What you might do is capture the last part with al the diseases in one group (named capturing group disease
) and then use split to get the individual ones and then make the list unique.
^\d*::[a-zA-Z]+[^\(]*\(\d{4}\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$
For example:
String regex = "^\\d*::[a-zA-Z]+[^\\(]*\\(\\d{4}\\)::(?<disease>(?:HIV|Cancer|flu|Arthritis|OCD)(?:\\|(?:HIV|Cancer|flu|Arthritis|OCD))*)$";
String string = "24::Robin (1980)::HIV|Cancer|Cancer|HIV";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
String[] parts = matcher.group("disease").split("\\|");
Set<String> uniqueDiseases = new HashSet<String>(Arrays.asList(parts));
System.out.println(uniqueDiseases);
}
Result:
[HIV, Cancer]