Search code examples
pythonpandastypeerror

Comparing two lists in pandas gives TypeError: 'float' object is not iterable


I want to compare two columns in the dataframe containing lists of strings. A part of my input data looks like that:

          CHROM      POS ID REF ALT QUAL FILTER INFO FORMAT P1-25 P1-93 P1-88 P1-6 P1-89 P1-26 P1-12 P1-92 P1-22 P1-90 P1-28 P1-95                                                  one_one                                         zero_zero                                       one_one_back                             zero_zero_next
0   NC_064017.1   153210  .   T   C    .      .    .     GT   0/0   0/0   1/1  0/0   1/1   1/1   1/1   0/0   0/0   0/1   0/0   0/0                             [P1-88, P1-89, P1-26, P1-12]  [P1-25, P1-93, P1-6, P1-92, P1-22, P1-28, P1-95]                                                NaN                      [P1-25, P1-12, P1-22]
1   NC_064017.1   965007  .   A   G    .      .    .     GT   0/0   1/1     .  0/1   1/1     .   0/0   1/1   0/0   0/1     .   0/1                                    [P1-93, P1-89, P1-92]                             [P1-25, P1-12, P1-22]                       [P1-88, P1-89, P1-26, P1-12]        [P1-25, P1-88, P1-12, P1-22, P1-28]
2   NC_064017.1   965038  .   C   T    .      .    .     GT   0/0   1/1   0/0  0/1   1/1     .   0/0   1/1   0/0   0/1   0/0   0/1                                    [P1-93, P1-89, P1-92]               [P1-25, P1-88, P1-12, P1-22, P1-28]                              [P1-93, P1-89, P1-92]                      [P1-93, P1-26, P1-92]
3   NC_064017.1  1084455  .   A   G    .      .    .     GT   1/1   0/0     .  1/1   1/1   0/0   1/1   0/0   0/1   1/1   0/1   1/1                [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95]                             [P1-93, P1-26, P1-92]                              [P1-93, P1-89, P1-92]                       [P1-25, P1-6, P1-28]
4   NC_064017.1  1117756  .   A   C    .      .    .     GT   0/0   0/1   1/1  0/0     .   1/1   1/1   1/1   1/1   1/1   0/0   1/1        [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95]                              [P1-25, P1-6, P1-28]          [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95]                      [P1-22, P1-90, P1-28]
5   NC_064017.1  1250643  .   T   C    .      .    .     GT   0/1   0/1   0/1  1/1   0/1   1/1   0/1   0/1   0/0   0/0   0/0   1/1                                     [P1-6, P1-26, P1-95]                             [P1-22, P1-90, P1-28]  [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95]                      [P1-22, P1-90, P1-28]
6   NC_064017.1  1250740  .   T   A    .      .    .     GT   0/1   1/1   0/1  1/1   0/1   1/1   0/1   0/1   0/0   0/0   0/0   0/1                                     [P1-93, P1-6, P1-26]                             [P1-22, P1-90, P1-28]                               [P1-6, P1-26, P1-95]  [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95]
7   NC_064017.1  1372722  .   A   C    .      .    .     GT   1/1   0/0   1/1  0/0   0/0   1/1   0/0   1/1   1/1   0/0   1/1   0/0               [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28]         [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95]                               [P1-93, P1-6, P1-26]                      [P1-93, P1-26, P1-28]
8   NC_064017.1  1502890  .   G   T    .      .    .     GT     .   0/0   1/1  0/1   0/1   0/0   1/1   1/1   1/1   0/1   0/0   1/1                      [P1-88, P1-12, P1-92, P1-22, P1-95]                             [P1-93, P1-26, P1-28]         [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28]                      [P1-89, P1-26, P1-95]

I've tried different methods using iteration but all raise the TypeError. Here is the first part of my code:

df['one_one'] = df.apply(lambda row: set(row['one_one']), axis=1)
df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
df['check'] = df.apply(lambda row: row['one_one'] in row['one_one_back'], axis=1)

and the error:

Traceback (most recent call last):
  File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 47, in <module>
    df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
  File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/frame.py", line 8740, in apply
    return op.apply()
  File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 688, in apply
    return self.apply_standard()
  File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
  File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 47, in <lambda>
    df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
TypeError: 'float' object is not iterable

The second part of the code with itertuples function doesn't work too:

for x in df.itertuples():
    if x.one_one in x.one_one_back:
        df['start_end'] = 'start'
    else:
        df['start_end'] = 0

and raises the error:

Traceback (most recent call last):
  File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 55, in <module>
    if x.one_one in x.one_one_back:
TypeError: argument of type 'float' is not iterable

I've checked the data types:

print(df.loc[0, 'one_one'], type(df.loc[0, 'one_one']))
print(df.loc[1, 'one_one_back'], type(df.loc[1, 'one_one_back']))

for x in df.loc[1, 'one_one_back']:
    print(x, type(x))

and for sure there are lists of strings:

['P1-88', 'P1-89', 'P1-26', 'P1-12'] <class 'list'>
['P1-88', 'P1-89', 'P1-26', 'P1-12'] <class 'list'>
P1-88 <class 'str'>
P1-89 <class 'str'>
P1-26 <class 'str'>
P1-12 <class 'str'>

So where is the float and why do I see the TypeError? Please, help because I'm totally confused...


Solution

  • You can use a list comprehension to loop over your pairs, and boolean indexing to only consider the rows with non-NaN values:

    # which rows have non-NA values in both columns?
    m = df[['one_one', 'one_one_back']].notna().all(axis=1)
    
    # for those, check if the first list is a subset of the second one
    df.loc[m, 'check'] = [set(a)<=set(b) for a, b in
                          zip(df.loc[m, 'one_one'], df.loc[m, 'one_one_back'])]
    

    Output:

             CHROM      POS ID REF ALT QUAL FILTER INFO FORMAT P1-25  ... P1-92 P1-22 P1-90 P1-28 P1-95                                            one_one  \
    0  NC_064017.1   153210  .   T   C    .      .    .     GT   0/0  ...   0/0   0/0   0/1   0/0   0/0                       [P1-88, P1-89, P1-26, P1-12]   
    1  NC_064017.1   965007  .   A   G    .      .    .     GT   0/0  ...   1/1   0/0   0/1     .   0/1                              [P1-93, P1-89, P1-92]   
    2  NC_064017.1   965038  .   C   T    .      .    .     GT   0/0  ...   1/1   0/0   0/1   0/0   0/1                              [P1-93, P1-89, P1-92]   
    3  NC_064017.1  1084455  .   A   G    .      .    .     GT   1/1  ...   0/0   0/1   1/1   0/1   1/1          [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95]   
    4  NC_064017.1  1117756  .   A   C    .      .    .     GT   0/0  ...   1/1   1/1   1/1   0/0   1/1  [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95]   
    5  NC_064017.1  1250643  .   T   C    .      .    .     GT   0/1  ...   0/1   0/0   0/0   0/0   1/1                               [P1-6, P1-26, P1-95]   
    6  NC_064017.1  1250740  .   T   A    .      .    .     GT   0/1  ...   0/1   0/0   0/0   0/0   0/1                               [P1-93, P1-6, P1-26]   
    7  NC_064017.1  1372722  .   A   C    .      .    .     GT   1/1  ...   1/1   1/1   0/0   1/1   0/0         [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28]   
    8  NC_064017.1  1502890  .   G   T    .      .    .     GT     .  ...   1/1   1/1   0/1   0/0   1/1                [P1-88, P1-12, P1-92, P1-22, P1-95]   
    
                                              zero_zero                                       one_one_back                             zero_zero_next  check  
    0  [P1-25, P1-93, P1-6, P1-92, P1-22, P1-28, P1-95]                                                NaN                      [P1-25, P1-12, P1-22]    NaN  
    1                             [P1-25, P1-12, P1-22]                       [P1-88, P1-89, P1-26, P1-12]        [P1-25, P1-88, P1-12, P1-22, P1-28]  False  
    2               [P1-25, P1-88, P1-12, P1-22, P1-28]                              [P1-93, P1-89, P1-92]                      [P1-93, P1-26, P1-92]   True  
    3                             [P1-93, P1-26, P1-92]                              [P1-93, P1-89, P1-92]                       [P1-25, P1-6, P1-28]  False  
    4                              [P1-25, P1-6, P1-28]          [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95]                      [P1-22, P1-90, P1-28]  False  
    5                             [P1-22, P1-90, P1-28]  [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95]                      [P1-22, P1-90, P1-28]  False  
    6                             [P1-22, P1-90, P1-28]                               [P1-6, P1-26, P1-95]  [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95]  False  
    7         [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95]                               [P1-93, P1-6, P1-26]                      [P1-93, P1-26, P1-28]  False  
    8                             [P1-93, P1-26, P1-28]         [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28]                      [P1-89, P1-26, P1-95]  False  
    
    [9 rows x 26 columns]