Friday, May 6, 2011

Regular expressions C# - is it possible to extract matches while matching?

Say, I have a string that I need to verify the correct format of; e.g. RR1234566-001 (2 letters, 7 digits, dash, 1 or more digits). I use something like:

        Regex regex = new Regex(patternString);
        if (regex.IsMatch(stringToMatch))
        {
            return true;
        }
        else
        {
            return false;
        }

This works to tell me whether the stringToMatch follows the pattern defined by patternString. What I need though (and I end up extracting these later) are: 123456 and 001 -- i.e. portions of the stringToMatch.

Please note that this is NOT a question about how to construct regular expressions. What I am asking is: "Is there a way to match and extract values simultaneously without having to use a split function later?"

From stackoverflow
  • You can use regex groups to accomplish that. For example, this regex:

    (\d\d\d)-(\d\d\d\d\d\d\d)
    

    Let's match a telephone number with this regex:

    var regex = new Regex(@"(\d\d\d)-(\d\d\d\d\d\d\d)");
    var match = regex.Match("123-4567890");
    if (match.Success)
        ....
    

    If it matches, you will find the first three digits in:

    match.Groups[1].Value
    

    And the second 7 digits in:

    match.Groups[2].Value
    

    P.S. In C#, you can use a @"" style string to avoid escaping backslashes. For example, @"\hi\" equals "\\hi\\". Useful for regular expressions and paths.

    P.S.2. The first group is stored in Group[1], not Group[0] as you would expect. That's because Group[0] contains the entire matched string.

    Neil Williams : +1 Very thorough! I'd add one thing though, the reason that you start on match.Groups[1] and not [0] is because [0] contains the entire matched string.
  • Use grouping and Matches instead.

    I.e.:

    // NOTE: pseudocode.
    Regex re = new Regex("(\\d+)-(\\d+)");
    Match m = regex.Match(stringToMatch))
    if (m.success) {
      String part1 = m.Groups[1].Value;
      String part2 = m.Groups[2].Value;
      return true;
    } 
    else {
      return false;
    }
    

    You can also name the matches, like this:

    Regex re = new REgex("(?<Part1>\\d+)-(?<Part2>\\d+)");
    

    and access like this

      String part1 = m.Groups["Part1"].Value;
      String part2 = m.Groups["Part2"].Value;
    
    gnomixa : very useful tip!
    Rob Fonseca-Ensor : +1 for named groups
  • You can use parentheses to capture groups of characters:

    string test = "RR1234566-001";
    
    // capture 2 letters, then 7 digits, then a hyphen, then 1 or more digits
    string rx = @"^([A-Za-z]{2})(\d{7})(\-)(\d+)$";
    
    Match m = Regex.Match(test, rx, RegexOptions.IgnoreCase);
    
    if (m.Success)
    {
        Console.WriteLine(m.Groups[1].Value);    // RR
        Console.WriteLine(m.Groups[2].Value);    // 1234566
        Console.WriteLine(m.Groups[3].Value);    // -
        Console.WriteLine(m.Groups[4].Value);    // 001
        return true;
    }
    else
    {
        return false;
    }
    
    Andomar : +1 for the right regex... btw if you use IgnoreCase, you can use [a-z] instead of [A-Za-z].

0 comments:

Post a Comment