I've a text file which contains the following sample UTF-8 text:
ኣእምሮኣዊ/ADJ ጥዕና/N ።/PUN
ቅድሚ/PRE ብዙሕ/ADJ ዓመታት/N “/PUN ኣእምሮኣዊ/ADJ ስንክልና/N ብጋኔን/N ወይ/CON እከይ/ADJ መናፍስቲ/N ኢዩ/V_AUX ዝመጽእ/V_REL “/PUN ዝብል/V_REL ግጉይ/ADJ ኣመለኻኽታ/N ነይሩ/V_GER ።/PUN
ከም/CON ውጺኢቱ/N ድማ/CON ኣእምሮኣዊ/ADJ ስንክልና/N ዘጋጠሞም/ADJ ኣባላት/N ናይ/PRE ሓደ/NUM ሕብረተ-ሰብ/N ብኣሰቃቕን/ADJ ኢሰብኣውን/ADJ ኣገባብ/N ይተሓዙ/V_IMF ነይሮም/V_AUX ።/PUN
Lingpipe implementation of HMM POS Tagger for Brown Corpus:
BrownCorpus
class reads the zipped POS Corpus as follows:
public class BrownPosCorpus implements PosCorpus {
private final File mBrownZipFile;
public BrownPosCorpus(File brownZipFile) {
mBrownZipFile = brownZipFile;
}
public Parser<ObjectHandler<Tagging<String>>> parser() {
return new BrownPosParser();
}
public Iterator<InputSource> sourceIterator() throws IOException {
return new BrownSourceIterator(mBrownZipFile);
}
static class BrownSourceIterator extends Iterators.Buffered<InputSource> {
private ZipInputStream mZipIn = null;
public BrownSourceIterator(File brownZipFile) throws IOException {
FileInputStream fileIn = new FileInputStream(brownZipFile);
mZipIn = new ZipInputStream(fileIn);
}
public InputSource bufferNext() {
ZipEntry entry = null;
try {
while ((entry = mZipIn.getNextEntry()) != null) {
if (entry.isDirectory()) continue;
String name = entry.getName();
if (name.equals("brown/CONTENTS")
|| name.equals("brown/README")) continue;
return new InputSource(mZipIn);
}
} catch (IOException e) {
// ignore and close and return null
}
Streams.closeQuietly(mZipIn);
return null;
}
}
}
The BrownPosParser.java class parses the zipped brown pos corpus as follows:
public class BrownPosParser
extends StringParser<ObjectHandler<Tagging<String>>> {
@Override
public void parseString(char[] cs, int start, int end) {
String in = new String(cs,start,end-start);
String[] sentences = in.split("\n");
for (int i = 0; i < sentences.length; ++i)
if (!Strings.allWhitespace(sentences[i]))
processSentence(sentences[i]);
}
public String normalizeTag(String rawTag) {
String tag = rawTag;
String startTag = tag;
// remove plus, default to first
int splitIndex = tag.indexOf('+');
if (splitIndex >= 0)
tag = tag.substring(0,splitIndex);
int lastHyphen = tag.lastIndexOf('-');
if (lastHyphen >= 0) {
String first = tag.substring(0,lastHyphen);
String suffix = tag.substring(lastHyphen+1);
if (suffix.equalsIgnoreCase("HL")
|| suffix.equalsIgnoreCase("TL")
|| suffix.equalsIgnoreCase("NC")) {
tag = first;
}
}
int firstHyphen = tag.indexOf('-');
if (firstHyphen > 0) {
String prefix = tag.substring(0,firstHyphen);
String rest = tag.substring(firstHyphen+1);
if (prefix.equalsIgnoreCase("FW")
|| prefix.equalsIgnoreCase("NC")
|| prefix.equalsIgnoreCase("NP"))
tag = rest;
}
// neg last, and only if not whole thing
int negIndex = tag.indexOf('*');
if (negIndex > 0) {
if (negIndex == tag.length()-1)
tag = tag.substring(0,negIndex);
else
tag = tag.substring(0,negIndex)
+ tag.substring(negIndex+1);
}
// multiple runs to normalize
return tag.equals(startTag) ? tag : normalizeTag(tag);
}
private void processSentence(String sentence) {
String[] tagTokenPairs = sentence.split(" ");
List<String> tokenList = new ArrayList<String>(tagTokenPairs.length);
List<String> tagList = new ArrayList<String>(tagTokenPairs.length);
for (String pair : tagTokenPairs) {
int j = pair.lastIndexOf('/');
String token = pair.substring(0,j);
String tag = normalizeTag(pair.substring(j+1));
tokenList.add(token);
tagList.add(tag);
}
Tagging<String> tagging
= new Tagging<String>(tokenList,tagList);
getHandler().handle(tagging);
}
}
The problem is the following bug occured while parsing the UTF-8 corpus: The key problem is in the BrownPosParser.java:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
The STACK TRACE is given below:
C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags>ant eval-brown
Buildfile: C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build.xml
compile:
[javac] Compiling 11 source files to C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build\classes
eval-brown:
[java] COMMAND PARAMETERS:
[java] Sent eval rate=5
[java] Toks before eval=1000000
[java] Max n-best eval=32
[java] Max n-gram=8
[java] Num chars=128
[java] Lambda factor=8.0
[java] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
[java] at BrownPosParser.parseString(BrownPosParser.java:20)
[java] at com.aliasi.corpus.StringParser.parse(StringParser.java:71)
[java] at EvaluatePos.parseCorpus(EvaluatePos.java:123)
[java] at EvaluatePos.run(EvaluatePos.java:75)
[java] at EvaluatePos.main(EvaluatePos.java:183)
[java] Java Result: 1
Which part of the code should I modify to properly parse the UTF-8 pos corpus?
Any help is much appreciated.
Not sure if it solves your issue; but to set the charset change this line:
mZipIn = new ZipInputStream(fileIn);
to
mZipIn = new ZipInputStream(new BufferedInputStream(fileIn), Charset.forName("UTF-8"));