Search code examples
nltktokenizecpu-word

how to tokenize a text by nltk python


i have a text like this:

Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() 
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

i tokenize this text with word_tokenize in python and output is:

Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist

But as you can see, the second line outputs several words that are dotted together. How to separate these as a Word?!

i use this python code:

>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]

and In fact, I want all words to be separated like this:

Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist

Solution

  • i found a simple way that use of RegexpTokenizer of nltk.tokenize like this:

    >>> from nltk.tokenize import RegexpTokenizer
    >>> tokenizer = RegexpTokenizer(r'\w+')
    

    The output after considering remove stopwords is as follows:

    Exception
    org
    baharan
    dominant
    dao
    core
    nonPlanAllocation
    INonPlanAllocationRepository
    getAllGrid
    cause
    org
    hibernate
    exception
    SQLGrammarException
    could
    extract
    ResultSet
    Caused
    java
    sql
    SQLSyntaxErrorException
    ORA-00942
    table
    view
    exist