In Python, I am using EggLib. I am trying to calculate Jost's D value per SNP found in a VCF file.
Data
Data is here in VCF format. The data set is small, there are 2 populations, 100 individuals per population and 6 SNPs (all on chromosome 1).
Each individual is named Pp.Ii
, where p
is the population index it belongs to and i
is the individual index.
Code
My difficulties concern the specification of the population structure. Here is my trial
### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf")
### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}
# Loop through each population
for popIndex in [0,1]:
# dictionnary for a given population. There are two populations
dpop = {}
# Loop through each individual
for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):
# A single list to define an individual
dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
dcluster[popIndex] = dpop
struct = {0: dcluster}
### Define the population structure ###
Structure = egglib.stats.make_structure(struct, None)
### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D
### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)
### Calculate Jost's D ###
cs.process_site(site, struct=Structure)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
self._frq.process_site(site, struct=struct)
File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)
The documentation indicates here
[The Structure object] is a tuple containing two items, each being a dict. The first one represents the ingroup and the second represents the outgroup.
The ingroup dictionary is itself a dictionary holding more dictionaries, one for each cluster of populations. Each cluster dictionary is a dictionary of populations, populations being themselves represented by a dictionary. A population dictionary is, again, a dictionary of individuals. Fortunately, individuals are represented by lists.
An individual list contains the index of all samples belonging to this individual. For haploid data, individuals will be one-item lists. In other cases, all individual lists are required to have the same number of items (consistent ploidy). Note that, if the ploidy is more than one, nothing enforces that samples of a given individual are grouped within the original data.
The keys of the ingroup dictionary are the labels identifying each cluster. Within a cluster dictionary, the keys are population labels. Finally, within a population dictionary, the keys are individual labels.
The second dictionary represents the outgroup. Its structure is simpler: it has individual labels as keys, and lists of corresponding sample indexes as values. The outgroup dictionary is similar to any ingroup population dictionary. The ploidy is required to match over all ingroup and outgroup individuals.
but I fail to make sense of it. The example provided is for fasta format and I don't understand to extend the logic to VCF format.
There are two errors
First error
The function make_structure
returns the Structure object but does not save it within stats
. You therefore have to save this output and use it in the function process_site
.
Structure = egglib.stats.make_structure(struct, None)
Second error
The Structure object must designate haploids. Therefore, create the dictionary as
dcluster = {}
for popIndex in [0,1]:
dpop = {}
for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):
dpop[IndIndex] = [IndIndex]
dcluster[popIndex] = dpop
struct = {0: dcluster}