I'm working on a project for a big data class, and I've installed the Cloudera Quickstart VM locally to run some basic tasks on my dataset and get familiar with some of the tools. I was following a tutorial which involved moving the dataset into HDFS, creating an HCatalog table based on the dataset file, then running Hive and/or Pig commands on the table. The problem is that my data is a large XML file, and the standard delimiter options in HCatalog do not apply.
Is there a way to import XML into HCatalog? If not, what is the best way to use Hive or Pig on my XML dataset?
EDIT: My file is from the public StackOverflow dataset. I am using the posts.xml
file. It's quite large (25GB) and I'm having trouble opening it on my machine, but below is the structure according to the Readme file:
- **posts**.xml
- Id
- PostTypeId
- 1: Question
- 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- CreationDate
- Score
- ViewCount
- Body
- OwnerUserId
- LastEditorUserId
- LastEditorDisplayName="Jeff Atwood"
- LastEditDate="2009-03-05T22:28:34.823"
- LastActivityDate="2009-03-11T12:51:01.480"
- CommunityOwnedDate="2009-03-11T12:51:01.480"
- ClosedDate="2009-03-11T12:51:01.480"
- Title=
- Tags=
- AnswerCount
- CommentCount
- FavoriteCount
Will the sheer size of this file be a problem in the VM? In the end we will be repeating some of these ETL tasks in AWS, but for now I am trying to avoid racking up a large bill without knowing how to properly use some of the tools.
XML uses a fairly standardized structure, so I would be interested in seeing your data format and what delimiter isn't working.
Without knowing more about the data/structure, etc ... this is probably what I would do:
--Example code
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE
REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();
validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;
-- Results
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
-- rss.txt data file
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster@hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>