Search code examples
pythonstringlistsubstringlist-comprehension

Retrieve a specific substring from each element in a list


It is few hours I am stuck with this: I have a Series called size_col of 887 elements and I want to retrieve from the sizes: S, M, L, XL. I have tried 2 different approaches, list comprehension and a simple if elif loop, but both attempts do not work.

sizes = ['S', 'M', 'L', 'XL']

tshirt_sizes = []
[tshirt_sizes.append(i) for i in size_col if i in sizes]

Second attempt:

sizes = []
for i in size_col:
if len(i) < 15:
   sizes.append(i.split(" / ",1)[-1])
else:
   sizes.append(i.split(" - ",1)[-1])

I created two conditions because in some cases the size follows the ' - ' and in some other the is a '/'. I honestly don't know how do deal with that.

Example of the list:

T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Honey" - L
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "I do very bad things" - M
T-Shirt Donna "Si dai. Ciao." - M
T-Shirt Donna "Stai nel tuo (mind your business)" - White / S
T-Shirt Donna "Stay Stronz" - White / L
T-Shirt Donna "Stay Stronz" - White / M
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Je suis esaurit" - Black / S
T-Shirt Donna "Si dai. Ciao." - S
T-Shirt Donna "Teamo - Tequila" - S / T-Shirt

Solution

  • You'll need regular expressions here. Precompile a regex pattern and then use pattern.search inside a list comprehension.

    sizes = ['S', 'M', 'L', 'XL']
    p = re.compile(r'\b({})\b'.format('|'.join(sizes))) 
    
    tshirt_sizes = [p.search(i).group(0) for i in size_col]
    

    print(tshirt_sizes)
    ['M', 'L', 'M', 'M', 'M', 'S', 'L', 'M', 'S', 'S', 'S', 'S']
    

    For added security, you may want a loop instead - list comprehensions are not good with error handling:

    tshirt_sizes = []
    for i in size_col:
        try:
            tshirt_sizes.append(p.search(i).group(0))
        except AttributeError:
            tshirt_sizes.append(None)
    

    Really the only reason to use regex here is to handle the last row in your data appropriately. In general, if you can, you should prefer the use of string operations (namely, str.split) unless avoidable, they're much faster and readable than regular expression based pattern matching and extraction.