Search code examples
c#regexalgorithmregex-groupregex-alternation

How to search a string for values and convert values


I have an API that accepts a string that needs to be properly formatted before going into the server.

The format for going into the server is the following

"{Country ABR} {Day/Hour} {State ABR} {Title} {hrs.} ({Month Year}.)"

Several Possibilities the client may send in :

"US Construction 7/70 hrs."

"IA Private hrs US.

"OIL US 8/70 hrs (Dec 2014).

Several valid examples after converting user input are:

"US 7/70 MI Construction hrs."

"US IA Private hrs."

"US OIL 8/70 hrs. (Dec 2014)" 

the converter arranges the input into the correct order. hrs always ends with a period and rearranges ({Month Year}) outside the sentence as shown.

so far I have

       [TestMethod]
    public void TestMethod1()
    {
        var toConvert = "USA Construction 70/700 (Dec 2014) hrs";
        var converted = ConvertHOSRules(toConvert);

        Assert.AreEqual(converted, "USA 70/700 Construction hrs.(Dec 2014)");
    }

    private string ConvertHOSRules(string input)
    {
        //todo refactor
        string output = "";

        string country = Regex.Match(input, @"\b(USA|CAN|MEX)\b").Value +" ";
        string dateHours =  Regex.Match(input,@"\d{1,2}\/\d{1,3}").Value + " ";
        string hrs = Regex.Match(input, @"\b(hrs)\b").Value ;
        var date = Regex.Match(input, @"\(([a-zA-Z]+\s{1}[0-9]{4})\)").Value + " ";
        string title = input.Replace(country, "").Replace(date, "").Replace(dateHours, "").Replace(hrs, "");
        output = $"{country} {dateHours} {title} {hrs}.{date}";
        return output;

    }

This is passing i need to refactor.. the + " " is like a null guard by lazy programmer


Solution

  • This question is quite interesting, especially if we would want to design algorithms for it, because my guess is that our regular expressions would be rather unnecessary.


    If we wish to do that with expressions, I would start with a simple expression such as listing possible countries and states in two capturing groups:

    (US|UK|FR)
    (CA|WA|IA|MO|MI)
    

    then our hours are well-structured:

    (\d+\/\d+)
    

    so is the month (.+?) and year ([0-9]+):

    \(((.+?)\s+([0-9]+))\)
    

    and here is where we would be facing problem with other keywords such as Construction and OIL, we could add a min 3 chars not to possibly conflict with states and countries:

    ([A-Z][a-z]{2,}|[A-Z]{3,})
    

    and last we would clean our string by collecting all spaces and other chars left, such as hrs. which is just repeating and we might not want to match or capture that.

    (.*?)
    

    Finally, we would combine using alternation:

    (US|UK|FR)|(CA|NY|IA|TX|MI)|(\d+\/\d+)|\(((.+?)\s+([0-9]+))\)|([A-Z][a-z]{2,}|[A-Z]{3,})|(.*?)
    

    DEMO

    Test

    using System;
    using System.Text.RegularExpressions;
    
    public class Example
    {
        public static void Main()
        {
            string pattern = @"(US|UK|FR)|(CA|NY|IA|TX|MI)|(\d+\/\d+)|\(((.+?)\s+([0-9]+))\)|([A-Z][a-z]{2,}|[A-Z]{3,})|(.*?)";
            string input = @"US 7/70 MI Construction hrs.
    US IA Private hrs.
    US OIL 8/70 hrs. (Dec 2014)
    UK 7/70 MI Construction hrs.
    UK IA Private hrs.
    UK OIL 8/70 hrs. (Dec 2014)
    FR 7/70 MI Construction hrs.
    FR IA Private hrs.
    FR OIL 8/70 hrs. (Dec 2014)";
            RegexOptions options = RegexOptions.Multiline;
    
            foreach (Match m in Regex.Matches(input, pattern, options))
            {
                Console.WriteLine("'{0}' found at index {1}.", m.Value, m.Index);
            }
        }
    }
    

    enter image description here

    DEMO

    RegEx

    If this expression wasn't desired, it can be modified/changed in regex101.com.

    RegEx Circuit

    jex.im visualizes regular expressions:

    enter image description here