I reckon my question wasn't all clear. So, time for another approach in explaining my quest.
Here is the explanation I gave before, which wasn't very good apparently
Using AWK I'm trying to 'slide a window' over a list of integers. If I would split up this dataset how many datapoints would there be in every (possible overlapping) bin? I like to set the binsize (or windowsize) and overlap between the bins. This approach enables me to get an idea of the local datapoint density. --> I have a little bit of AWK experience and I've been told tha AWK should be able to do the job, I prefer to use AWK. However, I'm also open to other ideas (Python for example).
I've prepared some example files which are attached below. However, to make it easier to get an idea here's another very very simplified example:
Datarange 1 - 10
Integers in dataset: 2,4,5,6,9
####
####Example 1: Windowsize=5,overlap=2####
####
file name ="1" contents are: (range is 1 - 5)
2
4
5
file name="3" contents are: (range is 4 - 8, that is, two overlap with previous range)
4
5
6
file name="7" contents are: (range is 7 - 10, if the range was larger, it would be 7 - 11)
9
####
####Example 2: Windowsize=3,overlap=0####
####
file name="1" contents are (range 1 - 3)
2
file name="4" contents are (range 4 - 6)
4
5
6
file name="7" contents are (range 7 - 9)
9
file name="9" contents are (range 10 - 10)
<none>
Example input file
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
Example output file with different windowsizes and overlap
> AWK -v windowsize=50000 -v overlap=0 -f awkscript input.file
> ls
1 50001
100001 150001
> cat 1
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 50001
> cat 100001
> cat 150001
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
> #And with some different paramenters
> AWK -v windowsize=160000 -v overlap=10000 -f awkscript input.file
> ls
1 10001
> cat 1
3579
3661
3752 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
3947 EXON 3706 4407 + Solyc06g005000.2.1.1Solyc06g005000.2.1
6734 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6865 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
6915 INTRON 5605 7662 + Solyc06g005000.2.1.2Solyc06g005000.2.1
8961
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
> cat 10001
13471 EXON 13449 13532 + Solyc06g005020.1.1.2Solyc06g005020.1.1
13561 INTRON 13533 13710 + Solyc06g005020.1.1.2Solyc06g005020.1.1
22226 EXON 22106 22261 + Solyc06g005030.1.1.1Solyc06g005030.1.1
22516
22556
36903 INTRON 36836 36915 + Solyc06g005060.2.1.1Solyc06g005060.2.1
37377 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37605 EXON 36916 37800 + Solyc06g005060.2.1.2Solyc06g005060.2.1
37935 3P_UTR 37801 38132 + Solyc06g005060.2.1.0Solyc06g005060.2.1
167942 5P_UTR 167930 167956 - Solyc06g005140.2.1.0Solyc06g005140.2.1
168020 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
168153 INTRON 167957 169025 - Solyc06g005140.2.1.2Solyc06g005140.2.1
Thank you so much for all your help!
Little adjustment of my initial question because I requires way more computing time than I anticipated for.
Is it possible that, instead of writing all the records that fall in a particular window to its own file, write the 'statistics' of each window to a row in a table? With statistics I mean, how many records does a particular window contain and how many of each type. Applied to the example above this would look like this:
> python script.py 160000 10000 file (using the script from sidharth c nadhan)
> cat result
window | total | exons | intron | 3P_UTR | 5P_UTR
1 | 17 | 6 | 5 | 1 | 0
10001 | 12 | 4 | 4 | 1 | 1
Try this :
import sys,os,collections
list1,set1=list(),set()
dict1 = collections.defaultdict(list)
dict2 = collections.defaultdict(int)
wind , overl, maxim= int(sys.argv[1]),int(sys.argv[2]),int(sys.argv[4])
for line in open(sys.argv[3]):
try : set1.add(line.split()[1])
except : pass
for i in xrange(1,maxim,wind-overl):
if int(line.split()[0]) in xrange(i,wind+i): dict1[i].append(line)
print "Window\tTotal\t",
for entri in set1 : print entri,"\t",
print "\n",
for j in xrange(1,maxim,wind-overl):
dict2.clear()
for line in dict1[j]:
try : dict2[line.split()[1]] +=1
except: pass
print j,"\t",len(dict1[j]),"\t",
for entri in set1:
print dict2[entri],"\t",
print "\n",
Usage : python script.py window overlap file maxim
Where maxim refers to the largest integer in the input file. maxim = 168153 in the sample input file. Giving it as a command line argument improves computational speed.