Search code examples
google-sheets

Count occurrences of given character per cell


Question

For example if I wanted to count the number of Ns in a column of strings how can I do this in Google Spreadsheets at a per cell basis (i.e. a formula that points at one cell at a time that I can drag down)?

String/Count table

Background

I'm having to decide a threshold -min-overlap <integer> for a program called TOMTOM** which compares similarity between PWMs*** of small DNA motifs****, N is a regular expression for any linear combination of the letters A, C, G and T. It would be nice if I could get an idea of the distribution of non-N lengths of my DNA motifs to help inform me of a proper -min-overlap <integer> value for TOMTOM.

And here are some real examples:

enter image description here

** TOMTOM is a tool for comparing a DNA motif to a database of known motifs. See here for more info.

*** PWM stands for Position Weight Matrix:

  • According to Wiki: A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences.
  • According to this paper, it could be defined as:

Position weight matrix (PWM) or PWM‐like models are widely used to represent DNA‐binding preferences of proteins (Stormo, 2000). In these models, a matrix is used to represent the TF‐binding site (TFBS), with each element representing the contribution to the overall binding affinity from a nucleotide at the corresponding position. An inherent assumption of traditional PWM models is position independence; that is, the contribution of different nucleotide positions within a TFBS to the overall binding affinity is assumed to be additive. Although this approximation is broadly valid, nevertheless, it does not hold for several proteins (Man & Stormo, 2001; Bulyk et al, 2002). To improve quantitative modeling, PWM models have been extended to include additional parameters, such as k‐mer features, to account for position dependencies within TFBSs (Zhao et al, 2012; Mathelier & Wasserman, 2013; Mordelet et al, 2013; Weirauch et al, 2013; Riley et al, 2015). Interdependencies between nucleotide positions have a structural origin. For example, stacking interactions between adjacent base pairs form the local three‐dimensional DNA structure. TFs have preferences for sequence‐dependent DNA conformation, which we call DNA shape readout (Rohs et al, 2009, 2010).

OR, more contemporarily:

Based on this rationale, an alternative approach to augment traditional PWM models is the inclusion of DNA structural features. Models of TF–DNA binding specificity incorporating these DNA shape features achieved comparable performance levels to models incorporating higher‐order k‐mer features, while requiring a much smaller number of parameters (Zhou et al, 2015). We previously revealed the importance of DNA shape readout for members of the basic helix‐loop‐helix (bHLH) and homeodomain TF families (Dror et al, 2014; Yang et al, 2014; Zhou et al, 2015). We were also able, for Hox TFs, to identify which regions in the TFBSs used DNA shape readout, demonstrating the power of the approach to reveal mechanistic insights into TF–DNA recognition (Abe et al, 2015). This capability was extensively shown for only two protein families, due to the lack of large‐scale high‐quality TF–DNA binding data. With the recent abundance of high‐throughput measurements of protein–DNA binding, it is now possible to dissect the role of DNA shape readout for many TF families.

**** DNA motif: wiki: In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids, which may not be adjacent.


Solution

  • A solution for one cell at a time (formula to be copied down):

    =len(A2)-len(SUBSTITUTE(A2,"N",""))