I am using pig latin for a large XML dump. I am trying to get the value of the xml node in pig latin. The file is like
< username>Shujaat< /username>
I want to get the input Shujaat. I tried piggybank XMLLoader but it only separates the xml tags and its values also. The code is
register piggybank.jar;
A = load 'username.xml' using org.apache.pig.piggybank.storage.XMLLoader('username')
as (x: chararray);
B = foreach A generate x;
This code gives me the username tags also and values too. I only need values. Any idea how to do that? I found out regular expression but didnt know much? Thanks
The example element you gave can be extracted with:
B = foreach A GENERATE REGEX_EXTRACT(x, '<username>(.*)</username>', 1)
AS name:chararray;
A nested element like this:
<user>
<id>456</id>
<username>Taylor</username>
</user>
can be extracted by with something like this:
B = foreach A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
'<user>\\n\\s*<id>(.*)</id>\\n\\s*<username>(.*)</username>\\n\\s*</user>'))
as (id: int, name:chararray);
(456,Taylor)
You will definitely need to define a more sophisticated regex that suits all of your needs, i.e: handles empty elements, attributes...etc. Another option is to write a custom UDF that uses Java libraries to parse the content of the XML so that you can avoid writing (over)complicated, error-prone regular expressions.