I want to compare two columns in the dataframe containing lists of strings. A part of my input data looks like that:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P1-25 P1-93 P1-88 P1-6 P1-89 P1-26 P1-12 P1-92 P1-22 P1-90 P1-28 P1-95 one_one zero_zero one_one_back zero_zero_next
0 NC_064017.1 153210 . T C . . . GT 0/0 0/0 1/1 0/0 1/1 1/1 1/1 0/0 0/0 0/1 0/0 0/0 [P1-88, P1-89, P1-26, P1-12] [P1-25, P1-93, P1-6, P1-92, P1-22, P1-28, P1-95] NaN [P1-25, P1-12, P1-22]
1 NC_064017.1 965007 . A G . . . GT 0/0 1/1 . 0/1 1/1 . 0/0 1/1 0/0 0/1 . 0/1 [P1-93, P1-89, P1-92] [P1-25, P1-12, P1-22] [P1-88, P1-89, P1-26, P1-12] [P1-25, P1-88, P1-12, P1-22, P1-28]
2 NC_064017.1 965038 . C T . . . GT 0/0 1/1 0/0 0/1 1/1 . 0/0 1/1 0/0 0/1 0/0 0/1 [P1-93, P1-89, P1-92] [P1-25, P1-88, P1-12, P1-22, P1-28] [P1-93, P1-89, P1-92] [P1-93, P1-26, P1-92]
3 NC_064017.1 1084455 . A G . . . GT 1/1 0/0 . 1/1 1/1 0/0 1/1 0/0 0/1 1/1 0/1 1/1 [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95] [P1-93, P1-26, P1-92] [P1-93, P1-89, P1-92] [P1-25, P1-6, P1-28]
4 NC_064017.1 1117756 . A C . . . GT 0/0 0/1 1/1 0/0 . 1/1 1/1 1/1 1/1 1/1 0/0 1/1 [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95] [P1-25, P1-6, P1-28] [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95] [P1-22, P1-90, P1-28]
5 NC_064017.1 1250643 . T C . . . GT 0/1 0/1 0/1 1/1 0/1 1/1 0/1 0/1 0/0 0/0 0/0 1/1 [P1-6, P1-26, P1-95] [P1-22, P1-90, P1-28] [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95] [P1-22, P1-90, P1-28]
6 NC_064017.1 1250740 . T A . . . GT 0/1 1/1 0/1 1/1 0/1 1/1 0/1 0/1 0/0 0/0 0/0 0/1 [P1-93, P1-6, P1-26] [P1-22, P1-90, P1-28] [P1-6, P1-26, P1-95] [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95]
7 NC_064017.1 1372722 . A C . . . GT 1/1 0/0 1/1 0/0 0/0 1/1 0/0 1/1 1/1 0/0 1/1 0/0 [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28] [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95] [P1-93, P1-6, P1-26] [P1-93, P1-26, P1-28]
8 NC_064017.1 1502890 . G T . . . GT . 0/0 1/1 0/1 0/1 0/0 1/1 1/1 1/1 0/1 0/0 1/1 [P1-88, P1-12, P1-92, P1-22, P1-95] [P1-93, P1-26, P1-28] [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28] [P1-89, P1-26, P1-95]
I've tried different methods using iteration but all raise the TypeError. Here is the first part of my code:
df['one_one'] = df.apply(lambda row: set(row['one_one']), axis=1)
df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
df['check'] = df.apply(lambda row: row['one_one'] in row['one_one_back'], axis=1)
and the error:
Traceback (most recent call last):
File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 47, in <module>
df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/frame.py", line 8740, in apply
return op.apply()
File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 688, in apply
return self.apply_standard()
File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 812, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/emoranska/miniconda3/envs/burak/lib/python3.7/site-packages/pandas/core/apply.py", line 828, in apply_series_generator
results[i] = self.f(v)
File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 47, in <lambda>
df['one_one_back'] = df.apply(lambda row: set(row['one_one_back']), axis=1)
TypeError: 'float' object is not iterable
The second part of the code with itertuples
function doesn't work too:
for x in df.itertuples():
if x.one_one in x.one_one_back:
df['start_end'] = 'start'
else:
df['start_end'] = 0
and raises the error:
Traceback (most recent call last):
File "/home/emoranska/Pulpit/burak/homozyg_regions/homozyg_bins.py", line 55, in <module>
if x.one_one in x.one_one_back:
TypeError: argument of type 'float' is not iterable
I've checked the data types:
print(df.loc[0, 'one_one'], type(df.loc[0, 'one_one']))
print(df.loc[1, 'one_one_back'], type(df.loc[1, 'one_one_back']))
for x in df.loc[1, 'one_one_back']:
print(x, type(x))
and for sure there are lists of strings:
['P1-88', 'P1-89', 'P1-26', 'P1-12'] <class 'list'>
['P1-88', 'P1-89', 'P1-26', 'P1-12'] <class 'list'>
P1-88 <class 'str'>
P1-89 <class 'str'>
P1-26 <class 'str'>
P1-12 <class 'str'>
So where is the float and why do I see the TypeError? Please, help because I'm totally confused...
You can use a list comprehension to loop over your pairs, and boolean indexing to only consider the rows with non-NaN values:
# which rows have non-NA values in both columns?
m = df[['one_one', 'one_one_back']].notna().all(axis=1)
# for those, check if the first list is a subset of the second one
df.loc[m, 'check'] = [set(a)<=set(b) for a, b in
zip(df.loc[m, 'one_one'], df.loc[m, 'one_one_back'])]
Output:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT P1-25 ... P1-92 P1-22 P1-90 P1-28 P1-95 one_one \
0 NC_064017.1 153210 . T C . . . GT 0/0 ... 0/0 0/0 0/1 0/0 0/0 [P1-88, P1-89, P1-26, P1-12]
1 NC_064017.1 965007 . A G . . . GT 0/0 ... 1/1 0/0 0/1 . 0/1 [P1-93, P1-89, P1-92]
2 NC_064017.1 965038 . C T . . . GT 0/0 ... 1/1 0/0 0/1 0/0 0/1 [P1-93, P1-89, P1-92]
3 NC_064017.1 1084455 . A G . . . GT 1/1 ... 0/0 0/1 1/1 0/1 1/1 [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95]
4 NC_064017.1 1117756 . A C . . . GT 0/0 ... 1/1 1/1 1/1 0/0 1/1 [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95]
5 NC_064017.1 1250643 . T C . . . GT 0/1 ... 0/1 0/0 0/0 0/0 1/1 [P1-6, P1-26, P1-95]
6 NC_064017.1 1250740 . T A . . . GT 0/1 ... 0/1 0/0 0/0 0/0 0/1 [P1-93, P1-6, P1-26]
7 NC_064017.1 1372722 . A C . . . GT 1/1 ... 1/1 1/1 0/0 1/1 0/0 [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28]
8 NC_064017.1 1502890 . G T . . . GT . ... 1/1 1/1 0/1 0/0 1/1 [P1-88, P1-12, P1-92, P1-22, P1-95]
zero_zero one_one_back zero_zero_next check
0 [P1-25, P1-93, P1-6, P1-92, P1-22, P1-28, P1-95] NaN [P1-25, P1-12, P1-22] NaN
1 [P1-25, P1-12, P1-22] [P1-88, P1-89, P1-26, P1-12] [P1-25, P1-88, P1-12, P1-22, P1-28] False
2 [P1-25, P1-88, P1-12, P1-22, P1-28] [P1-93, P1-89, P1-92] [P1-93, P1-26, P1-92] True
3 [P1-93, P1-26, P1-92] [P1-93, P1-89, P1-92] [P1-25, P1-6, P1-28] False
4 [P1-25, P1-6, P1-28] [P1-25, P1-6, P1-89, P1-12, P1-90, P1-95] [P1-22, P1-90, P1-28] False
5 [P1-22, P1-90, P1-28] [P1-88, P1-26, P1-12, P1-92, P1-22, P1-90, P1-95] [P1-22, P1-90, P1-28] False
6 [P1-22, P1-90, P1-28] [P1-6, P1-26, P1-95] [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95] False
7 [P1-93, P1-6, P1-89, P1-12, P1-90, P1-95] [P1-93, P1-6, P1-26] [P1-93, P1-26, P1-28] False
8 [P1-93, P1-26, P1-28] [P1-25, P1-88, P1-26, P1-92, P1-22, P1-28] [P1-89, P1-26, P1-95] False
[9 rows x 26 columns]