Removing everything but html tags from a corpus

I'm using the package tm. I have a corpus full of html document and I would like to remove everything but the html tags. I've been trying to do that for a few days but I don't seem to be able to find any good solution.

For example, let's say I have a document like this :

<html>
<body>

<h1>hello</h1>

</body>
</html>

I would want the document to become like this:

<html> <body> <h1>

(Or with the closing tags, I don't really mind.)

My goal is to count how many times each tag is used in a document.

Solution

I'm not familiar with tm, but here's how you could do it using Regular Expressions.

(Presupposition: your string starts and ends with an HTML tag)

str <- "<html><body><p>test<p>test2</body></html>"
str <- gsub(">[^<^>]+<", "> <", str) # remove all the text in between HTML tags, leaving only HTML tags (opening and closing)
str <- gsub("</[^<^>]+>", "", str) #remove all closing HTML tags.

That would leave you with your desired string.

If you're new to RegEx, check out this site for additional info getting started. Basically, the first gsub above is going to replace all text in between > and < which isn't an open or close bracket (i.e. all non-tag text). The second gsub will replace all text which starts with </ and ends with > with nothing -- removing the closing tags from the string