Search code examples
javasplittokenizestringtokenizerstringreader

Java Tokenization: Treat Anything Separated by an Underscore as One Word


I have a very simple tokenizer using StreamTokenizer, which will convert mathematical expressions into their individual components (below). The problem that I am having, is if there is a variable in the expression called T_1, it will split into [T,_,1], which I would like to return as [T_1].

I have attempted to use a variable to check if the last character was an underscore, and if so, append the underscore onto the list.Size-1, but it seems like a very clunky and inefficient solution. Is there a way to do this? Thanks!

        StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
        tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
        tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
        {
            switch (tokenizer.ttype) //Switch based on the type of token
            {
            case StreamTokenizer.TT_NUMBER: //Number
                tokBuf.add(String.valueOf(tokenizer.nval));
                break;
            case StreamTokenizer.TT_WORD: //Word
                tokBuf.add(tokenizer.sval);
                break;
            case '_':
                tokBuf.add(tokBuf.size()-1, tokenizer.sval);
                break;
            default: //Operator
                tokBuf.add(String.valueOf((char) tokenizer.ttype));
            }
        }

        return tokBuf;

Solution

  • This is what you want.

    tokenizer.wordChars('_', '_');
    

    This makes the _ recognizable as part of a word.

    Addenda:

    This builds and runs:

    public static void main(String args[]) throws Exception {
        String s = "abc_xyz abc 123 1 + 1";
        StreamTokenizer tokenizer = new StreamTokenizer(new StringReader(s));
        tokenizer.ordinaryChar('-'); // Don't parse minus as part of numbers.
        tokenizer.ordinaryChar('/'); // Don't parse slash as part of numbers.
        tokenizer.wordChars('_', '_'); // Don't parse slash as part of numbers.
    
    
        List<String> tokBuf = new ArrayList<String>();
        while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) //While not the end of file 
        {
            switch (tokenizer.ttype) //Switch based on the type of token
            {
            case StreamTokenizer.TT_NUMBER: //Number
                tokBuf.add(String.valueOf(tokenizer.nval));
                break;
            case StreamTokenizer.TT_WORD: //Word
                tokBuf.add(tokenizer.sval);
                break;
            default: //Operator
                tokBuf.add(String.valueOf((char) tokenizer.ttype));
            }
        }
        System.out.println(tokBuf);
    }
    
    run:
    [abc_xyz, abc, 123.0, 1.0, +, 1.0]