Search code examples
c#regexapostrophe

How can I prevent apostrophes from being stripped out only from the the midst of strings?


I need to preserve words with only alphanumeric characters, hypens, and apostrophes. I've got everything except apostrophes at present. The apostrophe in words like hadn't, didn't, and ain't are being stripped out using this code:

Regex onlyAlphanumericAndDash = new Regex("[^a-zA-Z0-9 -]");
. . .
foreach (string line in doc1StrArray) // doc1StrArray populated in FindAndStorePhrasesFoundInBothDocs()
{
    trimmedLine = line;
    // first replace the "long dash" with a space (otherwise the dashed words run together:
    // "consecrated—we" becomes "consecratedwe"
    trimmedLine = trimmedLine.Replace("—", " ");
    trimmedLine = onlyAlphanumericAndDash.Replace(trimmedLine, "");
    string[] subLines = trimmedLine.Split();
    foreach (string whirred in subLines)
    {
        if (String.IsNullOrEmpty(whirred)) continue;
        _whirred = whirred.Trim();
        iWordsInDoc1++;
        slAllDoc1Words.Add(_whirred);
        if (IgnoreWord(_whirred)) continue;
        InsertIntoWordStatsTable(_whirred, 1, 0);
    }
}

I need to preserve apostrophes, but only when they are within a word. Stated a little differently, apostrophes at the end of a word should be trimmed off, and also at the beginning (when it's a single quote); but apostrophes within a word -- in other words those that indicate contractions, such has "hadn't" -- should be preserved.

What do I need to add to the Regex or how do I need to modify it to accomplish this?


Solution

  • I'm a bit confused by your variable name subLines (implies lines of text) being created by Split() - parameterless Split will split on whitespace. Does subLines hence contain words, or lines? I think, despite the name, it contains words, so you can modify your regex to:

    [^a-zA-Z0-9 '-]
    

    This will leave all apostrophes alone. Note: I put it before the - rather than after so there is no risk of it defining a range (like A-Z is) of from (space) to (apostrophe) - something to bear in mind if you tried it already; when using - in a character class and you want - to be a character rather than mean "range", put it as the first (after not ^) or last thing in the class

    And you can remove apostrophes from the ends of your words with whirred.Trim('\'') - there isn't any point calling whirred.Trim() to remove whitespace, because the string was already split on whitespace so there won't be any whitespace left in it. Both Trim() and Split() split on any char defined as whitedpace by the Char.IsWhitespace(c) method