Slides for this lesson are available here
.
Estimating Admixture Graphs with qpGraph
¶
Based on previous analyses we can try to reconstruct a model that includes the described observations in an admixture graph, which combines in an unique model the relationships between multiple groups that can be single individuals or larger populations. For that we are going to use qpGraph
(also known as “ADMIXTUREGRAPH”) part of the package ADMIXTOOLS (Patterson et al. 2012. _qpGraph_ relies on an user-defined topology of the admixture graph and calculates the best-fitting admixture proportions and branch lengths based on the observed f-statistics.
Overview method¶
The general approach of an admixture graph is to reconstruct the genetic relationships between different groups through a phylogenetic tree allowing for the addition of admixture events. The method operates on a defined graph’s topology and estimates f2, f3, and f4-statistic values for all pairs, triples, and quadruples of groups, compared to the expected allele frequency correlation of the tested groups. For a given topology, qpGraph
provides branch lengths (in units of genetic drift) and mixture proportions. Groups that share a more recent common ancestor will covary more than others in their allele frequencies due to common genetic drift. We should keep in mind that the model is based on an unrooted tree and that while we show graphs with a selected outgroup as the root, the results should not depend on the root position.
Note
The reconstructed graph is never meant to reflect a comprehensive population history of the region under study but the best fitted model to the limit of the available groups and method’s resolution.
As reported in the Method section of Lipson and Reich 2017:
“Our usual strategy for building a model in ADMIXTUREGRAPH is to start with a small, well-understood subgraph and then add populations (either unadmixed or admixed) one at a time in their best-fitting positions. This involves trying different branch points for the new population and comparing the results. If a population is unadmixed, then if it is placed in the wrong position, the fit of the model will be poorer, and the inferred split point will move as far as it can in the correct direction, constrained only by the specified topology. Thus, searching over possible branching orders allows us to find a (locally) optimal topology. If no placement provides a good fit (in the sense that the residual errors are large), then we infer the presence of an admixture event, in which case we test for the best-fitting split points of the two ancestry components. After a new population is added, the topology relating the existing populations can change, so we examine the full model fit and any inferred zero-length internal branches for possible local optimizations.”
Data¶
For this session, we are going to use modern and ancient DNA data analysed in the recently published study about the settlement of the Americas Posth et al. 2018.
The genotype data is in “EIGENSTRAT” format and exclude from the “1240K panel” a subset of SNPs that are transitions in CpG sites. Those Cytosines are prone to be methylated and if deamination (the typical chemical modification of ancient DNA) occurs at the same position Cytosine is directly converted into Thymine without becoming Uracil. Thus, the resulting CtoT modification cannot be removed with an enzymatic reaction like performing uracil-DNA glycosylase (UDG) treatment. This results in additional noise in ancient DNA data that can be reduced by excluding those SNPs. This is especially relevant when analysing ancient samples that have been processed using different laboratory protocols.
The path to the data we are going to work on is the following:
/data/qpGraph/Posth2018_Americas.packedancestrymapgeno
/data/qpGraph/Posth2018_Americas.ind
/data/qpGraph/Posth2018_Americas.snp
Exercise
Check how many SNPs are in this dataset.
Preparing the parameter file¶
In order to run qpGraph
we need to prepare a parameter file that we can call “parQpgraph” and where we change “filename” with the names above:
DIR: /data/qpGraph
genotypename: DIR/<filename>.packedancestrymapgeno
snpname: DIR/<filename>.snp
indivname: DIR/<filename>.ind
outpop: NULL
useallsnps: YES
blgsize: 0.05
forcezmode: YES
lsqmode: YES
diag: .0001
bigiter: 6
hires: YES
lambdascale: 1
Let’s go through the most relevant parameter options. outpop: NULL
does not use an outgroup population to normalize f-stats by heterozygosity e.g. selecting a group in the graph in which SNPs must be polymorphic. useallsnps: YES
each comparison uses all SNPs overlapping in that specific test, otherwise the program looks only at the overlapping SNPs between all groups. blgsize: 0.05
is the block size in Morgans for Jackknife. diag: .0001
uses the entire matrix form of the objective function to avoid the basis dependence of the least-squares version of the computation. lambdascale: 1
in order to preserve the standard scaling of the f-statistics without an extra denominator. lsqmode: YES
otherwise unstable for large graphs. hires: YES
controls output when more decimals are desired.
Preparing the graph topology¶
We start by constructing a scaffold graph based on previously published studies (Lipson and Reich 2017, Moreno-Mayar et al. 2018 and Scheib et al. 2018). The following can be saved in a file called “Figure3a”:
root R
label Mbuti.DG Mbuti.DG
label Han.DG Han.DG
label Onge.DG Onge.DG
label MA1 MA1
label USR1 USR1
label USA_SanNicolas_4900BP USA_SanNicolas_4900BP
label Canada_Lucier_4800BP_500BP Canada_Lucier_4800BP_500BP
edge a R Mbuti.DG
edge b R nonAfrica
edge c1 nonAfrica EastEurasia
edge c2 EastEurasia EastAsia
edge c3 EastEurasia Onge.DG
edge c4 EastAsia EastAsia2
edge c5 EastAsia EastAsia3
edge c6 EastAsia2 Han.DG
edge c7 EastAsia2 EastAsia4
edge c8 nonAfrica WestEurasia
edge c9 WestEurasia E_HG2
edge c10 WestEurasia E_HG3
admix ANE E_HG2 EastAsia3
edge c11 ANE MA1
admix FA EastAsia4 E_HG3
edge c12 FA Beringia
edge c13 Beringia USR1
edge c14 Beringia NA
edge c15 NA Canada_Lucier_4800BP_500BP
edge c16 NA NA2
edge c17 NA2 USA_SanNicolas_4900BP
Running the program¶
To run qpGraph
we need to specify the parameter file “parQpgraph” and three output files that are .ggg
, .dot
and .out
:
qpGraph -p parQpgraph -g Figure3a -o Figure3a.ggg -d Figure3a.dot > Figure3a.out
Note
Running qpGraph
with this dataset takes 1-2 minutes.
Reading the output files¶
First let’s inspect the .out
file. If we open that on the terminal with less Figure3a.out
you can directly go to the bottom of the file with shift-g
sequence. Among other values we should see reported the total number of individuals used for all groups indivs
and the total number of snps
available from at least one individual.
Note
as mentioned above, not all those SNPs are used for each f-statistic
outpop: NULL
population: 0 Mbuti.DG 5
population: 1 Han.DG 4
population: 2 Onge.DG 2
population: 3 MA1 1
population: 4 USR1 1
population: 5 USA_SanNicolas_4900BP 17
population: 6 Canada_Lucier_4800BP_500BP 6
before setwt numsnps: 882908 outpop: NULL
setwt numsnps: 681581
number of blocks for moving block jackknife: 713
snps: 681581 indivs: 36
lambdascale: 1.000
At the very bottom of the .out
file are reported the outlier f4-statistics, which show the lowest or highest Z-scores. Those are calculated based on the difference between the fitted and the observed f4 values. The only worst f4-statistic identified in the model we just run is some un-modelled affinity between “USA_SanNicolas_4900BP” and “USR1” that is anyway below 3 standard deviations:
outliers:
Fit Obs Diff Std. error Z
worst f-stat: Mbu USR USA Can 0.000000 -0.001849 -0.001849 0.000681 -2.717
We can now visualize the graph in the .dot
file using the program graphviz with the following command and then look at the resulting .png
from the jupyter file browser. Dashed arrows represent admixture edges while solid arrows drift edges reported in units of \(\text{FST}\times 1,000\). On the very top, the worst f4-statistic is again reported:
dot -Tpng Figure3a.dot -o Figure3a.png
Finally we can have a look at the .ggg
file, which provides detailed proportions for admixture edges and drift lengths for each branch.
Note
generally it is important to not have zero-length edges because it might signify that the modelled edge does not exist. Also terminal edges for ancient groups, especially if composed by a single individual, are artificially long and should not be considered.
Adding new groups to the scaffold graph¶
Once confirmed that the scaffold graph has a good fit, we carry on adding the new Central and South American groups released in Posth et. al. 2018.
To the file Figure3a
add the following labels Belize_MayahakCabPek_9300BP, PERu_Cuncaicha_9000BP, Peru_Lauricocha_8600BP
as well as the following additional edges, save it with a new name like Figure3b
and run qpGraph
as shown before:
edge c18 NA2 CA
edge c19 CA Belize_MayahakCabPek_9300BP
edge c20 CA SA
edge c21 SA SA2
edge c22 SA2 PERu_Cuncaicha_9000BP
edge c23 SA2 Peru_Lauricocha_8600BP
The worst f4-statistic is -2.809 and despite in the .dot
file once converted into .png
there are some zero-length branches a more careful examination of the .ggg
file indicates that those edges are in fact different from zero.
Continuing to fit new groups¶
Now we can create a file called Figure3c
where we add the last three group labels Brazil_LapaDoSanto_9600BP, Argentina_ArroyoSeco2_7700BP, Chile_LosRieles_5100BP
. From node “SA” add an edge to form a new node called “SA3” that splits into Brazil_LapaDoSanto_9600BP
and a new node SA4
. Finally “SA4” splits itself into the two Southern Cone populations that are Argentina_ArroyoSeco2_7700BP
and Chile_LosRieles_5100BP
. After running it we can visualise the resulting .dot
file as a .png
. That is the final graph reported in Figure 3 of Posth et al. 2018!
Test the robustness of the graph topology¶
Starting from the final graph Figure3c
we can try, for example, to invert Belize_MayahakCabPek_9300BP
with USA_SanNicolas_4900BP
in a file called Figure3c.v2
to test for branching patterns between North and Central American groups. For more advanced modelling we can instead invert the entire “SA3” node with Belize_MayahakCabPek_9300BP
and call the file Figure3c.v3
to test for Central-South America branching patterns.
Exercise::
What do you observe when inspecting the respective .out
files? Which of the models fit and which not? How do you interpret that?
Adding admixture edges¶
We finally want to add in our working graph the oldest genome published so far from South America called CHIle_LosRieles_10900BP
. We initially try to position it as departing from each node without invoking admixture. One example is the following FigureS5a
file that we can copy and run as seen before:
root R
label Mbuti.DG Mbuti.DG
label Han.DG Han.DG
label Onge.DG Onge.DG
label MA1 MA1
label USR1 USR1
label USA_SanNicolas_4900BP USA_SanNicolas_4900BP
label Canada_Lucier_4800BP_500BP Canada_Lucier_4800BP_500BP
label Belize_MayahakCabPek_9300BP Belize_MayahakCabPek_9300BP
label PERu_Cuncaicha_9000BP PERu_Cuncaicha_9000BP
label Peru_Lauricocha_8600BP Peru_Lauricocha_8600BP
label Brazil_LapaDoSanto_9600BP Brazil_LapaDoSanto_9600BP
label Argentina_ArroyoSeco2_7700BP Argentina_ArroyoSeco2_7700BP
label Chile_LosRieles_5100BP Chile_LosRieles_5100BP
label CHIle_LosRieles_10900BP CHIle_LosRieles_10900BP
edge a R Mbuti.DG
edge b R nonAfrica
edge c1 nonAfrica EastEurasia
edge c2 EastEurasia EastAsia
edge c3 EastEurasia Onge.DG
edge c4 EastAsia EastAsia2
edge c5 EastAsia EastAsia3
edge c6 EastAsia2 Han.DG
edge c7 EastAsia2 EastAsia4
edge c8 nonAfrica WestEurasia
edge c9 WestEurasia E_HG2
edge c10 WestEurasia E_HG3
admix ANE E_HG2 EastAsia3
edge c11 ANE MA1
admix FA EastAsia4 E_HG3
edge c12 FA Beringia
edge c13 Beringia USR1
edge c14 Beringia NA
edge c15 NA Canada_Lucier_4800BP_500BP
edge c16 NA NA2
edge c17 NA2 USA_SanNicolas_4900BP
edge c18 NA2 CA
edge c19 CA Belize_MayahakCabPek_9300BP
edge c20 CA SA
edge c21 SA SA2
edge c22 SA SA3
edge c23 SA2 SA6
edge c24 SA2 CHIle_LosRieles_10900BP
edge c25 SA6 PERu_Cuncaicha_9000BP
edge c26 SA6 Peru_Lauricocha_8600BP
edge c27 SA3 Brazil_LapaDoSanto_9600BP
edge c28 SA3 SA4
edge c29 SA4 Argentina_ArroyoSeco2_7700BP
edge c30 SA4 Chile_LosRieles_5100BP
At the bottom of the newly produced .out
file there are several f4-statistics that have Z-scores below -3 or above 3. The worst one is the following statistics that indicates some un-modelled affinity between the two Chilean samples CHIle_LosRieles_10900BP
and Chile_LosRieles_5100BP
:
worst f-stat: Han Chi Bra CHI 0.000000 0.003372 0.003372 0.000862 3.912
We can then model a contribution from the oldest Chilean individual into the younger one. Change the last part of FigureS5a
with the following edges and one admixture event, save it as FigureS5a.v2
and run it again:
edge c24 SA2 SA7
edge c25 SA7 CHIle_LosRieles_10900BP
edge c26 SA6 PERu_Cuncaicha_9000BP
edge c27 SA6 Peru_Lauricocha_8600BP
edge c28 SA3 Brazil_LapaDoSanto_9600BP
edge c29 SA3 SA4
edge c30 SA4 Argentina_ArroyoSeco2_7700BP
edge c31 SA4 SA5
admix SA8 SA7 SA5
edge c32 SA8 Chile_LosRieles_5100BP
In the .out
file we see that most of the outlier f4-statistics are gone while the worst statistic is still present but reduced (Zscore=3.2). This suggests the presence of un-modelled affinity between the two oldest South American groups Brazil_LapaDoSanto_9600BP
and CHIle_LosRieles_10900BP
, that might represent a shared Anzick-1
-related ancestry that we investigate in detail in Figure 4 and Figure 5 of Posth et al. 2018.
The resulting admixture graph suggests that a component from the oldest Chilean individual contributed at least marginally to the younger individual despite being more than 5,000 years apart!