I am trying to parse a document containing email addresses, but the StreamTokenizer splits the E-mail address into two separate parts.
I already set the @
sign as an ordinaryChar
and space as the only whitespace
:
StreamTokenizer tokeziner = new StreamTokenizer(freader);
tokeziner.ordinaryChar('@');
tokeziner.whitespaceChars(' ', ' ');
Still, all E-mail addresses are split up.
A line to parse looks like the following:
"Student 6 Name6 LastName6 [email protected] Competition speech University of Innsbruck".
The Tokenizer splits [email protected]
to "del6"
and "uni.at"
.
Is there a way to tell the tokenizer to not split at @
signs?
So here is why it worked like it did:
StreamTokenizer
regards its input much like a programming language tokenizer. That is, it breaks it up into tokens that are "words", "numbers", "quoted strings", "comments", and so on, based on the syntax the programmer sets up for it. The programmer tells it which characters are word characters, plain characters, comment characters etc.
So in fact it does rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. Note that in a programing language, you can have a string like a = a+b;
. A simple tokenizer that merely breaks the text by whitespace would break this into a
, =
and a+b;
. But StreamTokenizer
would break this into a
, =
, a
, +
, b
, and ;
, and will also give you the "type" for each of these tokens, so your "language" parser can distinguish identifiers from operators. StreamTokenizer
's types are rather basic, but this behavior is the key to understanding what happened in your case.
It wasn't recognizing the @
as whitespace. In fact, it was parsing it and returning it as a token. But its value was in the ttype
field, and you were probably just looking at the sval
.
A StreamTokenizer
would recognize your line as:
The word Student The number 6.0 The word Name6 The word LastName6 The word del6 The character @ The word uni.at The word Competition The word speech The word University The word of The word Innsbruck
(This is the actual output of a little demo I wrote tokenizing your example line and printing by type).
In fact, by telling it that @
was an "ordinary character", you were telling it to take the @
as its own token (which it does anyway by default). The ordinaryChar()
documentation tells you that this method:
Specifies that the character argument is "ordinary" in this tokenizer. It removes any special significance the character has as a comment character, word component, string delimiter, white space, or number character. When such a character is encountered by the parser, the parser treats it as a single-character token and sets ttype field to the character value.
(My emphasis).
In fact, if you had instead passed it to wordChars()
, as in tokenizer.wordChars('@','@')
it would have kept the whole e-mail together. My little demo with that added gives:
The word Student The number 6.0 The word Name6 The word LastName6 The word [email protected] The word Competition The word speech The word University The word of The word Innsbruck
If you need a programming-language-like tokenizer, StreamTokenizer
may work for you. Otherwise your options depend on whether your data is line-based (each line is a separate record, there may be a different number of tokens on each line), where you would typically read lines one-by-one from a reader, then split them using String.split()
, or if it is just a whitespace-delimited chain of tokens, where Scanner
might suit you better.