I want to remove all characters such as commas, periods, quotation marks, etc. so that a line like this:
The infant Hans Patrick received his mammarial balm in the usual way, and not through the instrumentality of a patent bottle. One of his caprices, when yet a child, was to scream with all the force of his little lungs, when he was severely chastised by his parents. This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity.
...will be transformed into the following:
The infant Hans Patrick received his mammarial balm in the usual way and not through the instrumentality of a patent bottle One of his caprices when yet a child was to scream with all the force of his little lungs when he was severely chastised by his parents This singular habit was but a foreshadowing of that genius which has rendered him so eminent in his maturity
In this way I can split the individual words at the spaces and have no punctuation appendages at the end of the words.
I'm trying to do that with this code:
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9 '-]");
. . .
doc1StrArray = File.ReadAllLines(sDoc1Path, Encoding.UTF8);
. . .
foreach (string line in doc1StrArray)
{
trimmedLine = line;
trimmedLine = trimmedLine.Replace("—", " ");
trimmedLine = onlyAlphanumericSpaceApostropheAndHyphen.Replace(trimmedLine, "");
string[] subWords = trimmedLine.Split();
...but it does not work in every case, and I can't see why it usually works but other times strips out space characters, thus running two words together, so that the line ends up being this after stepping through the second line of code above:
The infant Hans Patrick received his mammarial balm in theusual way and not through the instrumentality of a patentbottle One of his caprices when yet a child was to screamwith all the force of his little lungs when he was severelychastised by his parents This singular habit was but aforeshadowing of that genius which has rendered him soeminent in his maturity
So, some of the words run together into a single word (no space between them):
theusual
patentbottle
screamwith
severelychastised
aforeshadowing
soeminent
Why is this happening, and how can I prevent it from continuing to happen?
It seems the spaces between those words are not space characters. Given what the text looks like in a fixed width font, broken at the first issue (the usual
):
The infant Hans Patrick received his mammarial balm in the
usual way, and not through the instrumentality of a patent
bottle. One of his caprices, when yet a child, was to scream
with all the force of his little lungs, when he was severely
chastised by his parents. This singular habit was but a
foreshadowing of that genius which has rendered him so
eminent in his maturity.
which shows all the problems occurring at a line break, it would appear they are newlines. You can work around this by changing the space in your regex to \s
to retain all forms of whitespace (noting that the \
must be escaped in a c# regex):
Regex onlyAlphanumericSpaceApostropheAndHyphen = new Regex("[^a-zA-Z0-9\\s'-]");