Possible to use HCatalog with XML? -- Doing ETL on Cloudera VM

I'm working on a project for a big data class, and I've installed the Cloudera Quickstart VM locally to run some basic tasks on my dataset and get familiar with some of the tools. I was following a tutorial which involved moving the dataset into HDFS, creating an HCatalog table based on the dataset file, then running Hive and/or Pig commands on the table. The problem is that my data is a large XML file, and the standard delimiter options in HCatalog do not apply.

Is there a way to import XML into HCatalog? If not, what is the best way to use Hive or Pig on my XML dataset?

EDIT: My file is from the public StackOverflow dataset. I am using the posts.xml file. It's quite large (25GB) and I'm having trouble opening it on my machine, but below is the structure according to the Readme file:

- **posts**.xml
   - Id
   - PostTypeId
      - 1: Question
      - 2: Answer
   - ParentID (only present if PostTypeId is 2)
   - AcceptedAnswerId (only present if PostTypeId is 1)
   - CreationDate
   - Score
   - ViewCount
   - Body
   - OwnerUserId
   - LastEditorUserId
   - LastEditorDisplayName="Jeff Atwood"
   - LastEditDate="2009-03-05T22:28:34.823"
   - LastActivityDate="2009-03-11T12:51:01.480"
   - CommunityOwnedDate="2009-03-11T12:51:01.480"
   - ClosedDate="2009-03-11T12:51:01.480"
   - Title=
   - Tags=
   - AnswerCount
   - CommentCount
   - FavoriteCount

Will the sheer size of this file be a problem in the VM? In the end we will be repeating some of these ETL tasks in AWS, but for now I am trying to avoid racking up a large bill without knowing how to properly use some of the tools.

Solution

XML uses a fairly standardized structure, so I would be interested in seeing your data format and what delimiter isn't working.

Without knowing more about the data/structure, etc ... this is probably what I would do:

Decide on my schema and create the HCatalog manually (or scripted, whichever is easiest).
Load the data via pig, using the piggybank XMLLoader.
Parse the data using regex into the schema that I decided upon for the HCat
Store it using the HCatStore method.

--Example code

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE 
REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;

-- Results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)

-- rss.txt data file

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>webmaster@hannonhill.com</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>