Search code examples
pandasnumpydata-cleaning

Use np.where extract item occurring error index out of range


I want to extract an item from two columns use np.where, DataFrame like: (total 100,000+ lines)

add Description: the "eNBID" is not always the third part of "ID" , the data is crazy dirty.

       ID         eNBID
460-00-2354-9     2354
4600023549        2354
46001368511       6789
4600332783112     32783

the result I want is:

       ID         eNBID     CI
460-00-2354-9     2354       9
4600023549        2354       9
46001368511       6789       11
4600332783112     32783      112

my code is :

df['Ci'] = np.where(df['ID'].astype(str).str.contains(r'-',na=False,regex=True), \
           df['ID'].apply(lambda x:re.split('-',str(x))[-1], \
           df.apply(lambda x:re.findall('([\w]{5})'+'([\w]{%d}'%(len(str(x.eNBID)))+'(\w*)',str(x.ID))[0][-1], axis=1))

the error is:

IndexError:('list index out of range','occurred at index 0')

there is my new code:

cond = df['ID'].astype(str).str.contains('-',na=False,regex=True)
df['CI'] = np.where(cond,df['ID'].apply(lambda x:re.split('-',str(x))[-1]), \
          df[~cond].apply(lambda x:re.findall('([\w]{5})'+'([\w]{%d}'%(len(str(x.eNBID)))+'(\w*)',str(x.ID))[0][-1], axis=1)) if len(str(x.eNBID))<(len(str(x.ID))-5) else "null", axis=1))

the error is :

ValueError:operands could not be broadcast together with shapes(100883,)(100883,)(78,)

Can anyone help me?


Solution

  • Try this

    df['s']=df['ID'].replace('-','', regex=True)
    df['Ci'] = df.apply(lambda x: x['s'][(5+len(str(x.eNBID))):], axis=1)
    df.drop('s', axis=1, inplace = True)
    

    Output

         ID            eNBID    Ci
    0   460-00-2354-9   2354    9
    1   4600023549      2354    9
    2   46001368511     6789    11
    3   4600332783112   32783   112