Search code examples
regextextnotepad++text-miningexcel-2013

Notepad++ moving tagged text strings to excel


So for a side hobby I'm doing some basic meta data gathering using text mining on the Project Gutenberg version of Herodotus but I'm stuck at the point of transferring the tagged text strings into excel. Essentially what I'm trying to do is create is a master list of all People, Places and Groups/Organizations mentioned in Herodotus and how many times each is mentioned in the text. I want to then use this list to populate some data visualizations in Tableau and/or Powerview, I have both.

I've already run the text through the Stanford NER which did a good job of at least identifying nearly all Persons, Organizations and Locations. I then manually checked over the document in notepadd++ to fix the numerous errors the NER made when analyzing ancient Greek names and places. I also removed the footnotes from the text because I don't care about them, only the original text. If you download the attached .txt you'll see that each proper noun is marked /PERSON, /LOCATION or /ORGANIZATION.

Now where I'm stuck is trying to get the tagged text strings into excel so I can use the data. A simple ctr+f reveals that in just book1 there are like 880 /PERSON tagged words. Essentially what I'm trying to do is grab each and every string that precedes one of the /PERSON, /LOCATION, or /ORGANIZATION and copy them into excel.

I looked into Regex expressions for notepad++ to see if I could select all text strings where the string ends in /PERSON but I cannot seem to figure it out. I can get the regex to select all "/PERSON" but I don't understand regex well enough to get it to select all "name/PERSON" or "place/LOCATION" strings in their entirety if that makes sense.

EDIT: I forgot to ask about using SQL or Python to help me solve this problem. From my work I'm familiar with using SQL queries on databases. So this is a stupid question but can you even use SQL to directly query a .txt file? If so then I could pretty easily write a SQL statement to extract the tagged text strings.

I'm less familiar with Python but is it possible to extract the info I'm looking for via some python scripting?

Finally the question I should have asked in the original question. Am I going about this all wrong? I think using Notepad++ to correct the Stanford NER tags was necessary but maybe going straight from the tagged .txt to excel is the wrong approach.

https://www.dropbox.com/s/k5m8yag6tpae05w/HerodotusB1NER.txt

2ND EDIT: So I finally got around to playing with the regex expressions both of you provided and they are almost working perfectly. However, I think its trimming off some of the result set actually.

A perfect example is the character "Deïokes" who is being trimmed into just "okes/PERSON" after I run the regex search. I think the a-z part of the regex is ignoring special letters like the umlaut over the i in Deïokes.

How would I tweak the regex search to tolerate those sorts of special characters? If the regex cannot accomodate those special characters then I think it wouldn't be too manually intensive to go in and fix the special characters where they show up here and there.


Solution

  • I gave this another try and found an awfully more easy solution to just copy the stuff to Excel. I don't have Notepad++, but I do use PSPad occasionally if my IDE is not around. It offers pretty much the same features as Notepad++. Some things it does better and others it doesn't. The regex search is pretty good, and the search dialogue has a button that says Copy.

    Find dialogue

    I copied your file and used my regex from the other answer without the capture groups. We don't need them as it will copy the complete match. Remember the \b is a word boundary and not a real character that will be copied.

    Copied search results

    And voila, here we go. A list of names with their classification that should be easy enough to copy to Excel and split into columns there.