Just curious how people usually deal with punctuation in machine translation.
For example, from language A to B we might have:
A: a b c d e f g
B: x y z, u v w
I am wondering how do we deal with the comma in language B? Say if we 're using seq2seq model, shall we simply remove it, or shall we also generate embedding for it and treat the comma the same way we treat other words?
I think no paper explicitly talks about it yet if I didn't miss anything.
A good application for Seq2Seq is machine translation.
In the case of English->German, we will see German sentence that requires additional comma, e.g.
EN: I shot him because the colonel had told me so.
DE: Ich habe auf ihn geschossen, weil es der Oberst mir befohlen hatte.
A good model will automatically learn to that often the first sub-clause before weil
(because) requires a comma to be grammatical.
There shouldn't be a need to do extra pre-processing beforehand.