two defaultdict(list)
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.0, 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'C', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count': '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'T', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}]
3:42259955 [{'group': '5555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '5555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
What i intend to do is
compare two default dict lists
first check is key matches
check if the ref and base are same in ids if yes store the depth info this will be constant
which is this entry
{'chr': '3', 'ref': 'G', 'depth': '224', 'base': 'G', 'count': '223', 'positive_strand': '121', 'negative_strand': '102', 'percent_bias': 0.54, 'vaf': 1.0, 'mutation': 'no-mutation', 'group': '5555', 'timepoint': 'D0', 'st': '42259955'}
check for base in ids == var(in this case 'C') in V1 if yes then get the count( which is 0), from ids
check for timepoints, if a time point is not in ids but in variant get the timepoint info and fill in other info from ids
Desired output
position timepoint chr st depth count base positive_strand negative_strand percent_bias vaf
3:42259955 D0 3 42259955 224 0 C 0 0 0 0
3:42259955 C1 3 42259955 224 0 C 0 0 0 0
3:42259955 C3 3 42259955 224 0 C 0 0 0 0
3:42259955 C4 3 42259955 224 0 C 0 0 0 0
What i have so far
def getValueOf(k, L):
#print(L)
print(len(L))
for i, v in enumerate(d[k] for d in L):
return i,v
for key in ids.keys() & V1.keys():
## first cond compare within each list
if getValueOf('ref', ids[key]) == getValueOf('base', ids[key]):
ref_count = getValueOf('count', ids[key])
ref_depth = getValueOf('depth', ids[key])
## secon cond compare between two deafultdicts
if getValueOf('var', V1[key]) == getValueOf('base', ids[key]):
var_count = getValueOf('count', ids[key])
Is there a elegant way to do this than this, should i use a defaultdict in the first place or a nested dictionary should work
3:42259955 [{'group': '555', 'timepoint': 'D0', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C1', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C3', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}, {'group': '555', 'timepoint': 'C4', 'chrm': '3', 'st': '42259955', 'en': '42259956', 'var': 'C'}]
3:42259955 [{'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'C', 'count': '4', 'positive_strand': '0', 'negative_strand': '4', 'percent_bias': 0.0, 'vaf': 0.03, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'G', 'count': '135', 'positive_strand': '99', 'negative_strand': '36', 'percent_bias': 0.73, 'vaf': 0.96, 'mutation': 'no-mutation', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'T', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': 'N', 'count': '0', 'positive_strand': '0', 'negative_strand': '0', 'percent_bias': '0', 'vaf': '0', 'mutation': 'snv', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+A', 'count': '1', 'positive_strand': '0', 'negative_strand': '1', 'percent_bias': 0.0, 'vaf': 0.01, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+C', 'count': '13', 'positive_strand': '0', 'negative_strand': '13', 'percent_bias': 0.0, 'vaf': 0.09, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}, {'chr': '3', 'ref': 'G', 'depth': '141', 'base': '+T', 'count': '11', 'positive_strand': '0', 'negative_strand': '11', 'percent_bias': 0.0, 'vaf': 0.08, 'mutation': 'ins', 'group': '555', 'timepoint': 'C4', 'st': '42259955'}]
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 4 C 0 4 0.0 0.03
1 3:42259955 C1 3 G 42259955 141 4 C 0 4 0.0 0.03
2 3:42259955 C3 3 G 42259955 141 4 C 0 4 0.0 0.03
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
position timepoint chr ref st depth count base positive_strand negative_strand percent_bias vaf
0 3:42259955 D0 3 G 42259955 141 0 C 0 0 0.0 0.00
1 3:42259955 C1 3 G 42259955 141 0 C 0 0 0.0 0.00
2 3:42259955 C3 3 G 42259955 141 0 C 0 0 0.0 0.00
3 3:42259955 C4 3 G 42259955 141 4 C 0 4 0.0 0.03
Ok, so I'm still not sure I've got your requirement down 100%. And it's certainly hard to know what oddities might crop up in a larger dataset, and also how inefficient this could become at scale. But I think I have solved your problem.
UPDATED TO SOLVE THE NEW PROBLEM:
This should be a viable solution. However at this point there are so many conditions and wrinkles, that I suspect we may be better off creating some tables using pandas
and performing some joining and aggregating queries in terms of efficiency and simplicity of code, rather than learning how to use for loops to iterate over nested dicts.
def comb_dicts(ids, v1):
fields = [
'position', 'timepoint', 'chr',
'st', 'depth', 'count', 'base',
'positive_strand', 'negative_strand',
'percent_bias', 'vaf'
]
def_cols = {
'count': 0, 'positive_strand': 0,
'negative_strand': 0, 'percent_bias': 0.0, 'vaf': 0.0
}
# Make a list for our output rows
rows = []
# Iterate through shared keys
for k in ids.keys() & v1.keys():
# Empty list for our new var dicts
var_ds = []
# Loop through the dicts in V1
for d in v1[k]:
# Find any matching dicts in the ids list - where the timepoints match
# Use ** unpacking to create new dicts - don't update because that will alter the originals
# Note the order of v and d, this ensures that any keys in both use the value from the V1 dict
# This is important later
var_ds = [
{**v, **d, 'position': k} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base'] and
d['timepoint'] == v['timepoint']
)
]
# If we didn't find any with matching timepoints in ids then look for ones without
# This is where the order of v and d is important. We will keep the V1 timepoint this way
# Since this case can result in a list of dicts where some could actually be identical
# we will need to de-dup it at some point - can do this later with pandas
# By unpacking def_cols last we can overwrite columns that we don't want copied from ids
if not var_ds:
var_ds = [
{**v, **d, 'position': k, **def_cols} for v in ids[k]
if (
v['base'] != v['ref'] and
d['var'] == v['base']
)
]
rows.extend(var_ds)
return rows
my_rows = comb_dicts(ids, V1)
df = pd.DataFrame.from_records(my_rows)
df.drop_duplicates(inplace=True)
df[fields]
# If you want the de-duped rows as a list of dicts then do
uniq_rows = df.to_dict('records')