Search code examples
pythondictionarybioinformaticsbiopythonvcf-variant-call-format

Indicating a population structure in EggLib Python


In Python, I am using EggLib. I am trying to calculate Jost's D value per SNP found in a VCF file.

Data

Data is here in VCF format. The data set is small, there are 2 populations, 100 individuals per population and 6 SNPs (all on chromosome 1).

Each individual is named Pp.Ii, where p is the population index it belongs to and i is the individual index.

Code

My difficulties concern the specification of the population structure. Here is my trial

### Read the vcf file ###
vcf = egglib.io.VcfParser("MyData.vcf") 

### Create the `Structure` object ###
# Dictionary for a given cluster. There is only one cluster.
dcluster = {}            
# Loop through each population 
for popIndex in [0,1]:  
    # dictionnary for a given population. There are two populations
    dpop = {}            
    # Loop through each individual
    for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
            # A single list to define an individual
        dpop[IndIndex] = [IndIndex*2, IndIndex*2 + 1]
    dcluster[popIndex] = dpop

struct = {0: dcluster}

### Define the population structure ###
Structure = egglib.stats.make_structure(struct, None) 

### Configurate the 'ComputeStats' object ###
cs = egglib.stats.ComputeStats()
cs.configure(only_diallelic=False)
cs.add_stats('Dj') # Jost's D

### Isolate a SNP ###
vcf.next()
site = egglib.stats.site_from_vcf(vcf)

### Calculate Jost's D ###
cs.process_site(site, struct=Structure)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/egglib/stats/_cstats.py", line 431, in process_site
    self._frq.process_site(site, struct=struct)
  File "/Library/Python/2.7/site-packages/egglib/stats/_freq.py", line 159, in process_site
    if sum(struct) != site._obj.get_ning(): raise ValueError, 'invalid structure (sample size is required to match)'
ValueError: invalid structure (sample size is required to match)

The documentation indicates here

[The Structure object] is a tuple containing two items, each being a dict. The first one represents the ingroup and the second represents the outgroup.

The ingroup dictionary is itself a dictionary holding more dictionaries, one for each cluster of populations. Each cluster dictionary is a dictionary of populations, populations being themselves represented by a dictionary. A population dictionary is, again, a dictionary of individuals. Fortunately, individuals are represented by lists.

An individual list contains the index of all samples belonging to this individual. For haploid data, individuals will be one-item lists. In other cases, all individual lists are required to have the same number of items (consistent ploidy). Note that, if the ploidy is more than one, nothing enforces that samples of a given individual are grouped within the original data.

The keys of the ingroup dictionary are the labels identifying each cluster. Within a cluster dictionary, the keys are population labels. Finally, within a population dictionary, the keys are individual labels.

The second dictionary represents the outgroup. Its structure is simpler: it has individual labels as keys, and lists of corresponding sample indexes as values. The outgroup dictionary is similar to any ingroup population dictionary. The ploidy is required to match over all ingroup and outgroup individuals.

but I fail to make sense of it. The example provided is for fasta format and I don't understand to extend the logic to VCF format.


Solution

  • There are two errors

    First error

    The function make_structure returns the Structure object but does not save it within stats. You therefore have to save this output and use it in the function process_site.

    Structure = egglib.stats.make_structure(struct, None) 
    

    Second error

    The Structure object must designate haploids. Therefore, create the dictionary as

    dcluster = {}            
    for popIndex in [0,1]:  
        dpop = {}            
        for IndIndex in range(popIndex * 100,(popIndex + 1) * 100):     
            dpop[IndIndex] = [IndIndex]
        dcluster[popIndex] = dpop
    
    struct = {0: dcluster}