Search code examples
c#regexc#-4.0data-extraction

Regular Expression in CS: data extraction


I have data like this:

<td><a href="/New_York_City" title="New York City">New York</a></td>

And I would like to get New York out of it.

I don't have any skill in regex what so ever. I have tried this though:

StreamReader sr = new StreamReader("c:\\USAcityfile2.txt");
string pattern = "<td>.*</td>";
Regex r = new Regex(pattern, RegexOptions.IgnoreCase);
Regex r1 = new Regex("<a .*>.*</a>", RegexOptions.IgnoreCase);
 string read = "";
while ((read = sr.ReadLine()) != null)
{
    foreach (Match m in r.Matches(read))
    {
        foreach (Match m1 in r1.Matches(m.Value.ToString()))
            Console.WriteLine(m1.Value);
    }
}
sr.Close();
sr.Dispose();

this gave me <a href="/New_York_City" title="New York City">New York</a>.

How can reach to data between <a .*> and </a>? thanks.


Solution

  • If you insist on a regex for this particular case, then try this:

    String pattern = @"(?<=<a[^>]*>).*?(?=</a>)
    

    (?<=<a[^>]*>) is a positive lookbehind assertion to ensure that there is <a[^>]*> before the wanted pattern.

    (?=</a>) is a positive lookahead assertion to ensure that there is </a> after the pattern

    .*? is a lazy quantifier, matching as less as possible till the first </a>

    A good reference for regular expressions is regular-expressions.info

    Their lookaround explanation