Search code examples
pythonlistmathelementlogarithm

python log2 over index of list of lists


I have a list of 4 lists that correspond to the 4 nucleotides (list 0 = A, list 1 = C, list 2 = G, list 3 = T. Each list is the same length (representing positions in a sequence). The elements of each list represent the frequency of that nucleotide at that position in the sequence across many sequences of a file (each list is the same length as the sequence). Here's an example with easy to work with values (in actuality I have long float values):

[[0.0, 1.0, 2.0, 3.0, 4.0, 5.0],[0.1, 1.1, 2.1, 3.1, 4.1, 5.1],[0.2,1.2, 2.2, 3.2, 4.2, 5.2],[0.3, 1.3, 2.3, 3.3, 4.3, 5.3]]

So these examples above show that the sequence contains 6 nucleotides and in position 0 the frequency of nucleotide A is 0.0. The frequency of nucleotide G (represented by the list at position 2) at position 2 is 1.2

I would like to perform a mathematical function on each element in a particular position for each nucleotide (each list), then sum those values for that position alone (ICi). Then repeat this for every position in the list and finally sum all of those into one value (IC). Below is the code, background is a list of length 4 (float values) I computed in another function and will need for the mathematical calculation.

import math
def function_name(lst, background):
    ab, cb, gb, tb = background[0], background[1], background[2], background[3]
    a, c, g, t = lst[0][:], lst[1][:], lst[2][:], lst[3][:]
    pos = 0
    IC = 0
    for list in lst:
      for i in list:
          loga = math.log(((a[pos])/ab), 2)
          logc = math.log(((c[pos])/cb), 2)
          logg = math.log(((g[pos])/gb), 2)
          logt = math.log(((t[pos])/tb), 2)
          ICi = (a[pos]*loga + c[pos]*logc + g[pos]*logg + t[pos]*logt)
          IC += ICi
    return IC

Below is my data for lst and background as test data:

lst = [[0.011740473738414007, 0.005561277033985582, 0.5701338825952627, 0.5069001029866117, 0.22183316168898043, 0.24675592173017508, 0.29474768280123587, 0.27394438722966014, 0.25458290422245106, 0.2514933058702369], [0.0014418125643666324, 0.02286302780638517, 0.07929969104016478, 0.13511843460350154, 0.12461380020597322, 0.16416065911431513, 0.17466529351184346, 0.20844490216271885, 0.22265705458290422, 0.22327497425334705], [0.9802265705458291, 0.003913491246138002, 0.13347064881565396, 0.08012358393408857, 0.43480947476828014, 0.13861997940267765, 0.14150360453141092, 0.11987641606591143, 0.11678681771369721, 0.11328527291452112], [0.006591143151390319, 0.9676622039134912, 0.21709577754891865, 0.2778578784757981, 0.21771369721936149, 0.4490216271884655, 0.38722966014418125, 0.3944387229660144, 0.40205973223480945, 0.4074150360453141]]

background = [0.26125394569167243, 0.1628634426694565, 0.17949426101679142, 0.3891011102722321]

From this data, I should be getting an IC of about 4.74 but instead I'm getting around 91...Any help you could provide an eager, young python student would be wonderful! I'm still learning so I'm not trying to use tools like numpy, I need to learn how to write the code using builtins (if that makes sense). Thank you in advance for your help!


Solution

  • Why do you set pos? Where do you use i? I don't understand precisely what you are trying to do; but it seems like your code is doing the exact same calculation over the first element of each list, summing the result each time, because pos does not change and i (from your nested for-loop) isn't used anywhere. That may be why the result doesn't make sense.

    Also avoid using names of builtin types for your variables (list); perhaps use nucleotide or something? Replace function_name with something more descriptive like logsum (or whatever that number represents).

    If I try this I get 4.41 (which is closer but no cigar ;-) )

    import math
    def function_name(lst, background):
        ab, cb, gb, tb = background[0], background[1], background[2], background[3]
        a, c, g, t = lst[0][:], lst[1][:], lst[2][:], lst[3][:]
        pos = 0
        IC = 0
        for pos in range(len(a)):
            loga = math.log(((a[pos])/ab), 2)
            logc = math.log(((c[pos])/cb), 2)
            logg = math.log(((g[pos])/gb), 2)
            logt = math.log(((t[pos])/tb), 2)
            ICi = (a[pos]*loga + c[pos]*logc + g[pos]*logg + t[pos]*logt)
            IC += ICi
        return IC
    

    Hope this helps you a little to figure out what you need ;-) Good luck!