Search code examples
rregexdata-cleaninggsub

Regular expression to get product attribute from product name in R


I have a set of product name and would like to extract product size
(1237ml, 370ML, 850g, 2400g, 11.2kg, 11.2kg, 2g, 200g, 300g)

The product name are a bit messy. There is no specific position of product size/formatting. For example,

strings <- c("product brand A 1237ml Bundle of 6" 
            , "product milk choc370ML" 
            , "brand milk Vanilla Flavor 850g" 
            , "One 2400g, For 0-6 Month-Old Infants" 
            , "a+...two...6-12months...11.2kg...milk" 
            , "a+...two...11.2kg 6-12months ..milk" 
            , "Product 200g (10x2g)"
            , "[200g] Product" 
            , "Product A brand(300g)"
)

I am very new to regular expression and trying to use it in R. So, not sure how to write the expression to cover all the cases here.

Below is the code that I'm using. As mentioned, they work for only some cases. Could someone please guide me on what should be the proper/applicable expression for the case?

extract1<-trimws(gsub(".* ([a-zA-Z0-9]+).*", "\\1", product))
extract2<-trimws(gsub(".*(...[0-9][Mm][Ll]).*", "\\1", product))
extract3<-trimws(gsub(".*(..[0-9][Mm][Ll]).*", "\\1", product))
extract4<-trimws(gsub(".*(...[0-9][Gg]).*", "\\1", product))
extract5<-trimws(gsub(".*(..[0-9][Gg]).*", "\\1", product))
extract6<-trimws(gsub(".*(...[0-9].[Gg]).*", "\\1", product))
extract7<-trimws(gsub(".*(..[0-9].[Gg]).*", "\\1", product))

Solution

  • Your requirements are quite complex, but if you plan to use a single regex to extract those values, you can use

    regmatches(strings, regexpr(".*(?:\\d(?:\\.\\d+)?\\s*x\\s*)?\\K(?<!\\d)\\d+(?:\\.\\d+)?(?:k?g|m?l)\\b|(?<!\\d)\\d+(?:\\.\\d+)?(?:k?g|m?l)(?=\\s*x\\s*\\d)", strings, perl=TRUE, ignore.case=TRUE))
    

    See the regex demo online.

    The main idea is to match the rightmost number that is followed with specified UOM abbreviation giving priority to the numbers around x.

    Details:

    • .* - any zero or more chars other than line break chars, as many as possible
    • (?:\d(?:\.\d+)?\s*x\s*)? - an optional group matching a digit, then an optional sequence of . and one or more digits, and then an x enclosed with zero or more whitespaces
    • \K - match reset operator that discards the text matched so far
    • (?<!\d)\d+(?:\.\d+)?(?:k?g|m?l)\b - one or more digits, then an optional sequence of . and one or more digits and then kg or g or ml or l as a whole word
    • | - or
    • (?<!\d) - no digit immediately to the left is allowed
    • \d+(?:\.\d+)?(?:k?g|m?l) - one or more digits, then an optional sequence of . and one or more digits, and then kg/g/ml or l
    • (?=\s*x\s*\d) - followed with an x enclosed with zero or more whitespace chars and then a digit.