Search code examples
c#androidregexstringregex-group

How to fix this regex for mentions and hashtags?


I have used the following tool to build a valid regex for mentions and hashtags. I have managed to match what I want in the inserted text, but I need the following matching problems to be resolved.

  • Only match those substrings which start and end with spaces. And in the case of a substring at the beginning or at the end of the string that is valid (be it a hashtag or a mention), also take it.

  • The matches found by the regex only take the part that does not contain spaces, (that the spaces are only part of the rule, but not part of the substring).

The regex that I have used is the following: (([@]{1}|[#]{1})[A-Za-z0-9]+)

Some examples of validity and non-validity for string matches:

"@hello friend" - @hello must be matched as a mention.
"@ hello friend" - here there should be no matches.
"hey@hello @hello" - here only the last @hello must be matched as a mention.
"@hello! hi @hello #hi ##hello" - here only the second @hello and #hi must be matched as a mention and hashtag respectively.

Another example in image, where only "@word" should be a valid mention:

enter image description here

Update 16:35 (GMT-4) 3/15/18

I found a way to solve the problem, using the tool in PCRE mode (server) and using negative lookbehind and negative lookahead:

(?<![^\s])(([@]{1}|[#]{1})[A-Za-z0-9]+)(?![^\s])

Here is the matches:

enter image description here

But now the doubt arises, it works with the regular expression in C#?, both the negative lookahead and the negative lookbehind, because for example in Javascript it would not work, as it was seen in the tool, it marks me with a red line.


Solution

  • Try this pattern:

    (?:^|\s+)(?:(?<mention>@)|(?<hash>#))(?<item>\w+)(?=\s+)
    

    Here it is broken down:

    • (?: creates a non-capturing group
    • ^|\s+ matches the beginning of the String or Whitespace
    • (?: creates a non-capturing group
    • (?<mention>@|(?<hash>#) creates a group to match @ or # and respectively named the groups mention and hash
    • (?<item>\w+) matches any alphanumeric character one or more times and helps pull the item from the group for easy usage.
    • (?=\s+) creates a positive look ahead to match any white-space

    Fiddle: Live Demo

    You would then need to use the underlying language to trim the returning match to remove any leading/trailing whitespace.

    Update Since you mentioned that you were using C#, I thought that I'd provide you with a .NET solution to solve your problem that does not require RegEx; while I did not test the results, I would guess that this would also be faster than using RegEx too.

    Personally, my flavor of .NET is Visual Basic, so I'm providing you with a VB.NET solution, but you can just as easily run it through a converter since I never use anything that can't be used in C#:

    Private Function FindTags(ByVal lead As Char, ByVal source As String) As String()
        Dim matches As List(Of String) = New List(Of String)
        Dim current_index As Integer = 0
    
        'Loop through all but the last character in the source
        For index As Integer = 0 To source.Length - 2
            'Reset the current index
            current_index = index
    
            'Check if the current character is a "@" or "#" and either we're starting at the beginning of the String or the last character was whitespace and then if the next character is a letter, digit, or end of the String
            If source(index) = lead AndAlso (index = 0 OrElse Char.IsWhiteSpace(source, index - 1)) AndAlso (Char.IsLetterOrDigit(source, index + 1) OrElse index + 1 = source.Length - 1) Then
                'Loop until the next character is no longer a letter or digit
                Do
                    current_index += 1
                Loop While current_index + 1 < source.Length AndAlso Char.IsLetterOrDigit(source, current_index + 1)
    
                'Check if we're at the end of the line or the next character is whitespace
                If current_index = source.Length - 1 OrElse Char.IsWhiteSpace(source, current_index + 1) Then
                    'Add the match to the collection
                    matches.Add(source.Substring(index, current_index + 1 - index))
                End If
            End If
        Next
    
        Return matches.ToArray()
    End Function
    

    Fiddle: Live Demo