Search code examples
stanford-nlp

How to extract text tagged by Stanford NER into csv?


I don't have a background in NLP or much programming, but have drifted in as I am doing research on the history of newspaper publishing. I'm wrestling with 10k+ pages of plain text that I'm having difficulty sculpting into structured data to perform more complex analyses.

I've been able to run Stanford NER on a large text file to successfully tag many of the entites I'm looking to examine. So here is my naive question: how can I extract or parse the tagged text file into a csv file - or at least separate lists for each category into some kind of structure?

For example, I'm looking at something like this:

The <ORGANIZATION>Committee on Education</ORGANIZATION> and the <ORGANIZATION>Philadelphia Assocation of Teachers</ORGANIZATION> offer a plan for the organization of the school in the town of <LOCATION>Erie</LOCATION>, <LOCATION>Pennsylvania</LOCATION> as it will be run by the honorable <PERSON>Williamson</PERSON> and <PERSON>Thompson</PERSON>

Based on looking through the vaugely similar answers to other questions on this site, I've looked at possibly using some kind of regular expression or even sed, like below, but without success.

sed -e '/^location/,/^/location/p' nertagged.txt

I've considered other options like BeautifulSoup or an XML parser (since the Stanford NER implementation can output XML), but I wonder if that isn't overkill since I'm dealing with a very limited number of tags -- basically just Person, Location, Organization. Are those my best options? What, in my ignorance, am I missing?

Many thanks.


Solution

  • Agree. This isn't actually as easy to do as would be desirable, and I'll add an option to make it easier for the next version :). But, if you use -outputFormat inlineXML as in your example, then the following Perl one-liner will do the trick, run on the output file, which I've called inlineXML.out.

    perl -ne 'while (s/<([^>]+)>([^<]*)<[^>]+>//) { print "$2\t$1\n"; }' inlineXML.out
    

    This actually puts a tab between columns not a comma. Most spreadsheets will read that fine. If you really want a comma, you can replace the \t above with , but you may well end up with problems if some of the entities include commas, such as perhaps University of California , Davis.