Search code examples
pythonloopspandascontain

Python pandas check if the last element of a list in a cell contains specific string


my dataframe df:

index                        url
1           [{'url': 'http://bhandarkarscollegekdp.org/'}]
2             [{'url': 'http://cateringinyourhome.com/'}]
3                                                     NaN
4                  [{'url': 'http://muddyjunction.com/'}]
5                       [{'url': 'http://ecskouhou.jp/'}]
6                     [{'url': 'http://andersrice.com/'}]
7       [{'url': 'http://durager.cz/'}, {'url': 'http:andersrice.com'}]
8            [{'url': 'http://milenijum-osiguranje.rs/'}]
9       [{'url': 'http://form-kind.org/'}, {'url': 'https://osiguranje'},{'url': 'http://beseka.com.tr'}]

I would like to select the rows if the last item in the list of the row of url column contains 'https', while skipping missing values.

My current script

df[df['url'].str[-1].str.contains('https',na=False)]

returns False values for all the rows while some of them actually contains https.

Can anybody help with it?


Solution

  • I think you can first replace NaN to empty url and then use apply:

    df = pd.DataFrame({'url':[[{'url': 'http://bhandarkarscollegekdp.org/'}],
                              np.nan,
                             [{'url': 'http://cateringinyourhome.com/'}],  
                             [{'url': 'http://durager.cz/'}, {'url': 'https:andersrice.com'}]]},
                      index=[1,2,3,4])
    
    print (df)
                                                     url
    1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
    2                                                NaN
    3        [{'url': 'http://cateringinyourhome.com/'}]
    4  [{'url': 'http://durager.cz/'}, {'url': 'https...
    

    df.loc[df.url.isnull(), 'url'] = [[{'url':''}]]
    print (df)
                                                     url
    1     [{'url': 'http://bhandarkarscollegekdp.org/'}]
    2                                      [{'url': ''}]
    3        [{'url': 'http://cateringinyourhome.com/'}]
    4  [{'url': 'http://durager.cz/'}, {'url': 'https...
    
    print (df.url.apply(lambda x: 'https' in x[-1]['url']))
    1    False
    2    False
    3    False
    4     True
    Name: url, dtype: bool
    

    First solution:

    df.loc[df.url.notnull(), 'a'] = 
    df.loc[df.url.notnull(), 'url'].apply(lambda x: 'https' in x[-1]['url'])
    
    df.a.fillna(False, inplace=True)
    print (df)
                                                     url      a
    1     [{'url': 'http://bhandarkarscollegekdp.org/'}]  False
    2                                                NaN  False
    3        [{'url': 'http://cateringinyourhome.com/'}]  False
    4  [{'url': 'http://durager.cz/'}, {'url': 'https...   True