Search code examples
javaapache-tikaopennlp

How to remove lines with non word characters using java?


Hi I have crawled some html files using Apache tika and write the text content to a text file, when i write the content to the text file i am getting some white spaces and some different symbols, so using opennlp chunking parser when I tried to parse these file lines I am getting error at ParserTool.parseLine in below code for those lines which are non words.

InputStream is = new FileInputStream("en-parser-chunking.bin");

    ParserModel model = new ParserModel(is);

    opennlp.tools.parser.Parser parser = ParserFactory.create(model);
    File dir = new File("C://htmlmetadata");
    File listDir[] = dir.listFiles();
    System.out.println("no of files:"+listDir.length);
    for (int i = 0; i < listDir.length; i++) 
    {

    String path=listDir[i].getAbsolutePath();
     System.out.println("file name"+listDir[i].getName());
      Scanner scanner = new Scanner(new FileInputStream(path), "UTF-8");

      while (scanner.hasNextLine())
        {
              String line=scanner.nextLine();
             if(line!=null)
                 {
                     Parse topParses[] = ParserTool.parseLine(line, parser, 1);
                        for (Parse p : topParses)
                        {
                            p.show();

                        }
                    System.out.println("line in if"+line);
                    System.out.println("line length in if"+line.length());
                 }
        }
}

I have tried by checking line.length>0 its also not working because line length is grater than 0 but it contains some special characters, so please suggest me to get the lines which are having words in it.

Thanks


Solution

  • iterate through each character and

    if ((int(character)>=65 && int(character)<=90) || (int(character)>=97 && int(character)    <=122))
    continue
    
    else {
    
    //skip that line
    
    
    }