I have files with a lines as such, where every row has an index (a,b) and then a list of number associated to them
a\t12|123|08340|4985
b\t3856|12|276
What i want is to get to this output
12 a
123 a
8340 a
4985 a
3856 b
276 b
Note that I am only wanting to output a unique set of the genes, with the value of first occurrence in case there are more than one of the same numbers in the rows.
I went about it in this way: by trying to add the numbers to a dictionary with the letter as keys, and the numbers as values. Finally, only outputting the set() of the numbers together with the corresponding letter.
uniqueval = set()
d = defaultdict(list)
for line in file:
fields = line.strip().split(\t)
Idx = fields[0]
Values = fields[1].split("|")
for Val in Values:
uniqueval.add(Val)
d[Idx] += Val
for u in uniqueval:
print u,"\t", [key for key in d.keys() if u in d.values()]
The script runs, but when I look into the dictionary, the Val's are all split by character, as such:
{'a': ['1','2','1'....], 'b': ['3', '8',....]}
I don't understand why the Values get split since it's in a for loop, I thought it was going to take each Val as a new value to add to the dict. Could you help me understand this issue?
Thank you.
You are extending your lists with Val
:
d[Idx] += Val
This adds each character in Val
as a separate element.
Use append()
instead:
d[Idx].append(Val)