Search code examples
pythonregexalphabetdatumbox

Select rows with alphabet characters only


My data are in the following format:

data = [['@datumbox', '#machinelearning'],
 ['@datumbox', '#textanalysis'],
 ['@things_internet', '#iot'],
 ['@things_internet', '#h...'],
 ['@custmrcom', '#analytics123'],
 ['@custmrcom', '#strategy...123'],
 ['@custmrcom', '#1knowledgetweet'],
 ['@tamaradull', '#@bigbrother']]

I would like to check whether the hashtag contains any non-alphabet. If so, the respective rows are removed.

The desired output is:

data = [['@datumbox', '#machinelearning'],
 ['@datumbox', '#textanalysis'],
 ['@things_internet', '#iot']]

I think I need to use re.sub (e.g, re.compile('[^a-zA-Z]')). This is what I have so far:

newdata = []

for item in data:
    regex = re.compile('[^a-zA-Z]')
    if regex.match(item[1]):
        newdata.append([item[0], item[1]])

Any suggestion?


Solution

  • Use a list comprehension with a condition:

    newdata = [x for x in data if x[1][1:].isalpha()]
    print newdata
    

    Gives the output

    [['@datumbox', '#machinelearning'], ['@datumbox', '#textanalysis'], ['@things_internet', '#iot']]