Search code examples
javaemailstreamtokenize

Java StreamTokenizer splits Email address at @ sign


I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.

I already set the @ sign as an ordinaryChar and space as the only whitespace:

StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('@');
tokeziner.whitespaceChars(' ', ' ');

Still, all E-mail addresses are split up.

A line to parse looks like the following:

"Student 6 Name6 LastName6 [email protected]  Competition speech University of Innsbruck".

The Tokenizer splits [email protected] to "del6" and "uni.at".

Is there a way to tell the tokenizer to not split at @ signs?


Solution

  • So here is why it worked like it did:

    StreamTokenizer regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.

    So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;. A simple tokenizer that merely breaks the text by whitespace would break this into a, = and a+b;. But StreamTokenizer would break this into a, =, a, +, b, and ;, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer's types are rather basic, but this behavior is the key to understanding what happened in your case.

    It wasn't recognizing the @ as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype field, and you were probably just looking at the sval.

    A StreamTokenizer would recognize your line as:

    The word Student
    The number 6.0
    The word Name6
    The word LastName6
    The word del6
    The character @
    The word uni.at
    The word Competition
    The word speech
    The word University
    The word of
    The word Innsbruck
    

    (This is the actual output of a little demo I wrote tokenizing your example line and printing by type).

    In fact, by telling it that @ was an "ordinary character", you were telling it to take the @ as its own token (which it does anyway by default). The ordinaryChar() documentation tells you that this method:

    Specifies that the character argument is "ordinary" in this tokenizer. It removes any special significance the character has as a comment character, word component, string delimiter, white space, or number character. When such a character is encountered by the parser, the parser treats it as a single-character token and sets ttype field to the character value.

    (My emphasis).

    In fact, if you had instead passed it to wordChars(), as in tokenizer.wordChars('@','@') it would have kept the whole e-mail together. My little demo with that added gives:

    The word Student
    The number 6.0
    The word Name6
    The word LastName6
    The word [email protected]
    The word Competition
    The word speech
    The word University
    The word of
    The word Innsbruck
    

    If you need a programming-language-like tokenizer, StreamTokenizer may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split(), or if it is just a whitespace-delimited chain of tokens, where Scanner might suit you better.