Search code examples
c#regexasp.net-mvc

Create a regex to get a text from below the matches


My Input string

[Name]
Jhon

[Age]
45

[MobileNumber]
1020304050

Billing address                     Delivery address 
India                               India

I need to extract the text based on the above string.

Code

static void Main(string[] args)
{
    string strContent =
        @"[Name]
          Jhon
          
          [Age]
          45
          
          [MobileNumber]
          1020304050
          
          Billing address                     Delivery address 
          GJ-India                                  MH-India"
    ;

   
    string value = string.Empty;

    var match = Regex.Match(strContent, @"\[Name\]\s*(.*)", RegexOptions.Multiline);

    if (match.Success)
    {
        value = match.Groups[1].Value;
    }
    Console.WriteLine(value); //Jhon

    match = Regex.Match(strContent, @"\[Age\]\s*(.*)", RegexOptions.Multiline);
    if (match.Success)
    {
        value = match.Groups[1].Value;
    }
    Console.WriteLine(value); //45

    match = Regex.Match(strContent, @"\[MobileNumber\]\s*(.*)", RegexOptions.Multiline);
    if (match.Success)
    {
        value = match.Groups[1].Value;
    }
    Console.WriteLine(value); //1020304050

    match = Regex.Match(strContent, "Billing address (.*)", RegexOptions.Multiline);
    Console.WriteLine(value); //India

    match = Regex.Match(strContent, "Delivery address (.*)", RegexOptions.Multiline);
    Console.WriteLine(value); //India

    Console.ReadLine();
}

Output

Expected output

if I pass [Name] then the result should be a "Jhon"

Similar to the Delivery address: expected result: India

I've added the expected result in the comment.

Actual output

But currently I'm getting India for every field in the result.


Solution

    • [Age] doesn't mean "the string Age", but "any of the 3 characters A, g, e". You have to put a backslash before the [ and ] to match a litteral one (well, you'll put two backslashes before the [ and ], as you are inside a " which itself asks for a backslash to escape the one you want to pass to the Regex)
    • \[Age\] (.*) would mean "the [Age] string followed by a space followed by the data line. You don't have any space after "[Age]" (but directly the end-of-line instead) so it won't match. Replace the space by a newline.
    • You don't need Multiline, as this will only change the meaning of ^ and $ that you don't use.
    • You put the match into the match variable, but then you WriteLine value instead (which still has the value of the match on [Name]). Use match.Groups[1].Value.
    • Your last match is not as simple as the others, as you change from a "[Field1] Field1Value [Field2] Field2Value" format to a tabular one, with reads "Field1 Field2 Field1Value Field2Value". Thus your Field2Value does not follow Field2, you'll have to detect the header line, find the position of your field name in it, and look for the same column number in the next line.

    The INI-style part

    So for the "[Field] Value" part, each one of your blocks will become:

    match = Regex.Match(strContent, "\\[Age\\]\n(.*)");
    Console.WriteLine(match.Groups[1].Value);
    

    You can see the full solution (including the tabular part) in a fiddle.

    The tabular part

    I put it apart, 1. because it was not part of the question, 2. because it's way more complex, and 3. because it's my first C# program ever, so it lacks polishing, conciseness, best practice, and so on.

            // Lookup for a known field, either at the start of a line, or after a field separator of at least 2 spaces.
            var headerMatch = Regex.Match(strContent, "(?:^|.*  )Billing address(?:  .*|$)", RegexOptions.Multiline);
            // Split the line to get the individual fields.
            var fieldsMatches = Regex.Matches(headerMatch.Value+"  ", "([^ ](?:[^ ]+| [^ ]+)*)  +", RegexOptions.Multiline);
            var fieldNames = fieldsMatches.Select(m => m.Value.Trim()).ToArray();
            var fieldPos = fieldsMatches.Select(m => m.Index).ToArray();
            var fieldLengths = fieldsMatches.Select(m => m.Length).ToArray();
            // Get the lines following the header line, until an empty line or the end of the block.
            var dataLines = Regex.Match(strContent.Substring(headerMatch.Index + headerMatch.Length), "(?:\n.+)*");
            // For each line, loop to isolate individual fields.
            var fieldVals = new Dictionary<string, string>();
            foreach(var fieldName in fieldNames)
                fieldVals.Add(fieldName, "");
            foreach(Match line in Regex.Matches(dataLines.Value, ".+"))
            {
                var fieldNum = 0;
                var ls = line.Value;
                foreach(var fieldName in fieldNames)
                {
                    var pos = fieldPos[fieldNum];
                    var length = fieldLengths[fieldNum];
                    string fragment = ls.Length <= pos ? "" : ls.Substring(pos, pos + length > ls.Length ? ls.Length - pos : length);
                    fragment = fragment.TrimEnd();
                    // For multiline field values, separate each segment from the previous with a newline,
                    // except if it starts with a space in which case it is just the wrapped tail of the previous (à la LDIF).
                    var concatenator = fieldVals[fieldName].Length > 0 && fragment.Length > 0 && fragment.Substring(0, 1) != " " ? "\n" : "";
                    fieldVals[fieldName] += concatenator+fragment;
                    ++fieldNum;
                }
            }
            foreach(var field in fieldVals)
                Console.WriteLine(field.Key+": "+field.Value.Replace("\n", " <newline> "));