Search code examples
regextext-mining

Regular expressions extracting multiple product attributes from product description


I have a set of product descriptions from which i want to extract product attributes through regular expressions.

https://regex101.com/r/HTTfNR/1

Product Description

BL460c G6 X5550 6G 1P Svr  
BL460c G6 E5540 6G 1P Svr  
BL460c G6 E5540 6G 1P Svr  
BL460c G6 E5530 6G 1P Svr  
BL460c G6 L5520 6G 1P Svr  
BL460c G6 E5520 6G 1P Svr  
BL460c G6 E5506 6G 1P Svr  
BL460c G6 E5502 6G 1P Svr  
BL280c G6 L5520 2G LP 1P Svr  
BL280c G6 E5520 2G 1P Svr  
BL280c G6 E5540 2G 1P Svr  
BL280c G6 E5502 2G 1P Svr  
S-Buy BL460c G6 E5540 8G 2P Svr  
S-Buy BL460c G6 E5530 4G 1P Svr  
S-Buy BL460c G6 E5530 4G 1P Svr  
BL2x220c G6 E5540 24G 2P 250GB Svr  
BL2x220c G6 E5530 24G 2P 250GB Svr  
BL2x220c G6 L5530 24G 2P 250GB Svr  
BL2x220c G6 L5520 24G 2P  
BL2x220c G6 E5640 2x2P 24G Svr  
BL2x220c G6 E5630 2x2P 24G Svr  
BL2x220c G6 L5640 2x2P 24G Svr  
BL2x220c G6 Mod0 Svr  
BL280c G6 X5650 6G 1P Svr  
BL280c G6 E5630 4G 1P Svr  
BL280c G6 L5640 4G 1P Svr  
BL280c G6 E5506 2G 1P Svr  
BL620c G7 E7-2860 32G Svr  
BL620c G7 E7-2850 32G Svr  
BL620c G7 E7-2830 32G Svr  
BL680c G7 E7-4860 64G Svr  
BL680c G7 E7-4860 64G Svr  
BL680c G7 E7-4850 64G Svr  
BL680c G7 E7-4830 64G Svr
BL680c G7 E7 4830 64G Svr   

I want to solve this using regular expressions.

I have tried this but i am unable to get this working for all use cases of my 1step.

\b(?!\d)([ELX0-9-])\w{1,}

I want to Extract x5550/E5540/E7-2860/E7-2860/E7 4830 as my 1st step. Can someone help me with a code to extract this text from above text?


Solution

  • If the match should start with either E X or L you can omit the negative lookahead (?!\d) and only use those in the character class without the hyphen and the digits.

    Then match an optional digit followed by either a space or hyphen.

    \b[EXL](?:\d[ -])?\d+(?!\S)
    

    In parts

    • \b[EXL] Word boundary, then match either E X or L
    • (?:\d[ -])? Optionally match a digit followed by a space or hyphen
    • \d+ Match 1+ digits
    • (?!\S) Negative lookahead, assert what is directly on the right is not a non whitespace character

    Regex demo