Slides for this lesson are available here.

Estimating Admixture Graphs with qpGraph

Based on previous analyses we can try to reconstruct a model that includes the described observations in an admixture graph, which combines in an unique model the relationships between multiple groups that can be single individuals or larger populations. For that we are going to use qpGraph (also known as “ADMIXTUREGRAPH”) part of the package ADMIXTOOLS (Patterson et al. 2012. _qpGraph_ relies on an user-defined topology of the admixture graph and calculates the best-fitting admixture proportions and branch lengths based on the observed f-statistics.

Overview method

The general approach of an admixture graph is to reconstruct the genetic relationships between different groups through a phylogenetic tree allowing for the addition of admixture events. The method operates on a defined graph’s topology and estimates f2, f3, and f4-statistic values for all pairs, triples, and quadruples of groups, compared to the expected allele frequency correlation of the tested groups. For a given topology, qpGraph provides branch lengths (in units of genetic drift) and mixture proportions. Groups that share a more recent common ancestor will covary more than others in their allele frequencies due to common genetic drift. We should keep in mind that the model is based on an unrooted tree and that while we show graphs with a selected outgroup as the root, the results should not depend on the root position.

Note

The reconstructed graph is never meant to reflect a comprehensive population history of the region under study but the best fitted model to the limit of the available groups and method’s resolution.

As reported in the Method section of Lipson and Reich 2017:

“Our usual strategy for building a model in ADMIXTUREGRAPH is to start with a small, well-understood subgraph and then add populations (either unadmixed or admixed) one at a time in their best-fitting positions. This involves trying different branch points for the new population and comparing the results. If a population is unadmixed, then if it is placed in the wrong position, the fit of the model will be poorer, and the inferred split point will move as far as it can in the correct direction, constrained only by the specified topology. Thus, searching over possible branching orders allows us to find a (locally) optimal topology. If no placement provides a good fit (in the sense that the residual errors are large), then we infer the presence of an admixture event, in which case we test for the best-fitting split points of the two ancestry components. After a new population is added, the topology relating the existing populations can change, so we examine the full model fit and any inferred zero-length internal branches for possible local optimizations.”

Data

For this session, we are going to use modern and ancient DNA data analysed in the recently published study about the settlement of the Americas Posth et al. 2018.

The genotype data is in “EIGENSTRAT” format and exclude from the “1240K panel” a subset of SNPs that are transitions in CpG sites. Those Cytosines are prone to be methylated and if deamination (the typical chemical modification of ancient DNA) occurs at the same position Cytosine is directly converted into Thymine without becoming Uracil. Thus, the resulting CtoT modification cannot be removed with an enzymatic reaction like performing uracil-DNA glycosylase (UDG) treatment. This results in additional noise in ancient DNA data that can be reduced by excluding those SNPs. This is especially relevant when analysing ancient samples that have been processed using different laboratory protocols.

The path to the data we are going to work on is the following:

/data/qpGraph/Posth2018_Americas.packedancestrymapgeno
/data/qpGraph/Posth2018_Americas.ind
/data/qpGraph/Posth2018_Americas.snp

Exercise

Check how many SNPs are in this dataset.

Preparing the parameter file

In order to run qpGraph we need to prepare a parameter file that we can call “parQpgraph” and where we change “filename” with the names above:

DIR: /data/qpGraph
genotypename: DIR/<filename>.packedancestrymapgeno
snpname: DIR/<filename>.snp
indivname: DIR/<filename>.ind
outpop:  NULL
useallsnps: YES
blgsize: 0.05
forcezmode: YES
lsqmode: YES
diag:  .0001
bigiter: 6
hires: YES
lambdascale: 1

Let’s go through the most relevant parameter options. outpop: NULL does not use an outgroup population to normalize f-stats by heterozygosity e.g. selecting a group in the graph in which SNPs must be polymorphic. useallsnps: YES each comparison uses all SNPs overlapping in that specific test, otherwise the program looks only at the overlapping SNPs between all groups. blgsize: 0.05 is the block size in Morgans for Jackknife. diag:  .0001 uses the entire matrix form of the objective function to avoid the basis dependence of the least-squares version of the computation. lambdascale: 1 in order to preserve the standard scaling of the f-statistics without an extra denominator. lsqmode: YES otherwise unstable for large graphs. hires: YES controls output when more decimals are desired.

Preparing the graph topology

We start by constructing a scaffold graph based on previously published studies (Lipson and Reich 2017, Moreno-Mayar et al. 2018 and Scheib et al. 2018). The following can be saved in a file called “Figure3a”:

root    R
label    Mbuti.DG    Mbuti.DG
label    Han.DG    Han.DG
label    Onge.DG    Onge.DG
label    MA1    MA1
label    USR1   USR1
label    USA_SanNicolas_4900BP    USA_SanNicolas_4900BP
label    Canada_Lucier_4800BP_500BP    Canada_Lucier_4800BP_500BP

edge    a R Mbuti.DG
edge    b R nonAfrica
edge    c1 nonAfrica EastEurasia
edge    c2 EastEurasia EastAsia
edge    c3 EastEurasia Onge.DG
edge    c4 EastAsia EastAsia2
edge    c5 EastAsia EastAsia3
edge    c6 EastAsia2 Han.DG
edge    c7 EastAsia2 EastAsia4
edge    c8 nonAfrica WestEurasia
edge    c9 WestEurasia E_HG2
edge    c10 WestEurasia E_HG3
admix   ANE E_HG2 EastAsia3
edge    c11 ANE MA1
admix   FA EastAsia4 E_HG3
edge    c12 FA Beringia
edge    c13 Beringia USR1
edge    c14 Beringia NA
edge    c15 NA Canada_Lucier_4800BP_500BP
edge    c16 NA NA2
edge    c17 NA2 USA_SanNicolas_4900BP

Running the program

To run qpGraph we need to specify the parameter file “parQpgraph” and three output files that are .ggg , .dot and .out:

qpGraph -p parQpgraph -g Figure3a -o Figure3a.ggg -d Figure3a.dot > Figure3a.out

Note

Running qpGraph with this dataset takes 1-2 minutes.

Reading the output files

First let’s inspect the .out file. If we open that on the terminal with less Figure3a.out you can directly go to the bottom of the file with shift-g sequence. Among other values we should see reported the total number of individuals used for all groups indivs and the total number of snps available from at least one individual.

Note

as mentioned above, not all those SNPs are used for each f-statistic

outpop: NULL
population:   0             Mbuti.DG    5
population:   1               Han.DG    4
population:   2              Onge.DG    2
population:   3                  MA1    1
population:   4                 USR1    1
population:   5 USA_SanNicolas_4900BP   17
population:   6 Canada_Lucier_4800BP_500BP    6
before setwt numsnps: 882908  outpop: NULL
setwt numsnps: 681581
number of blocks for moving block jackknife: 713
snps: 681581  indivs: 36
lambdascale:     1.000

At the very bottom of the .out file are reported the outlier f4-statistics, which show the lowest or highest Z-scores. Those are calculated based on the difference between the fitted and the observed f4 values. The only worst f4-statistic identified in the model we just run is some un-modelled affinity between “USA_SanNicolas_4900BP” and “USR1” that is anyway below 3 standard deviations:

outliers:
                                                      Fit          Obs         Diff   Std. error         Z


worst f-stat:       Mbu        USR        USA        Can       0.000000    -0.001849    -0.001849     0.000681    -2.717

We can now visualize the graph in the .dot file using the program graphviz with the following command and then look at the resulting .png from the jupyter file browser. Dashed arrows represent admixture edges while solid arrows drift edges reported in units of \(\text{FST}\times 1,000\). On the very top, the worst f4-statistic is again reported:

dot -Tpng Figure3a.dot -o Figure3a.png

Finally we can have a look at the .ggg file, which provides detailed proportions for admixture edges and drift lengths for each branch.

Note

generally it is important to not have zero-length edges because it might signify that the modelled edge does not exist. Also terminal edges for ancient groups, especially if composed by a single individual, are artificially long and should not be considered.

Adding new groups to the scaffold graph

Once confirmed that the scaffold graph has a good fit, we carry on adding the new Central and South American groups released in Posth et. al. 2018. To the file Figure3a add the following labels Belize_MayahakCabPek_9300BP, PERu_Cuncaicha_9000BP, Peru_Lauricocha_8600BP as well as the following additional edges, save it with a new name like Figure3b and run qpGraph as shown before:

edge    c18 NA2 CA
edge    c19 CA Belize_MayahakCabPek_9300BP
edge    c20 CA SA
edge    c21 SA SA2
edge    c22 SA2 PERu_Cuncaicha_9000BP
edge    c23 SA2 Peru_Lauricocha_8600BP

The worst f4-statistic is -2.809 and despite in the .dot file once converted into .png there are some zero-length branches a more careful examination of the .ggg file indicates that those edges are in fact different from zero.

Continuing to fit new groups

Now we can create a file called Figure3c where we add the last three group labels Brazil_LapaDoSanto_9600BP, Argentina_ArroyoSeco2_7700BP, Chile_LosRieles_5100BP. From node “SA” add an edge to form a new node called “SA3” that splits into Brazil_LapaDoSanto_9600BP and a new node SA4. Finally “SA4” splits itself into the two Southern Cone populations that are Argentina_ArroyoSeco2_7700BP and Chile_LosRieles_5100BP. After running it we can visualise the resulting .dot file as a .png. That is the final graph reported in Figure 3 of Posth et al. 2018!

Test the robustness of the graph topology

Starting from the final graph Figure3c we can try, for example, to invert Belize_MayahakCabPek_9300BP with USA_SanNicolas_4900BP in a file called Figure3c.v2 to test for branching patterns between North and Central American groups. For more advanced modelling we can instead invert the entire “SA3” node with Belize_MayahakCabPek_9300BP and call the file Figure3c.v3 to test for Central-South America branching patterns.

Exercise::

What do you observe when inspecting the respective .out files? Which of the models fit and which not? How do you interpret that?

Adding admixture edges

We finally want to add in our working graph the oldest genome published so far from South America called CHIle_LosRieles_10900BP. We initially try to position it as departing from each node without invoking admixture. One example is the following FigureS5a file that we can copy and run as seen before:

root    R
label    Mbuti.DG    Mbuti.DG
label    Han.DG    Han.DG
label    Onge.DG    Onge.DG
label    MA1    MA1
label    USR1    USR1
label    USA_SanNicolas_4900BP    USA_SanNicolas_4900BP
label    Canada_Lucier_4800BP_500BP    Canada_Lucier_4800BP_500BP
label    Belize_MayahakCabPek_9300BP    Belize_MayahakCabPek_9300BP
label    PERu_Cuncaicha_9000BP    PERu_Cuncaicha_9000BP
label    Peru_Lauricocha_8600BP    Peru_Lauricocha_8600BP
label    Brazil_LapaDoSanto_9600BP    Brazil_LapaDoSanto_9600BP
label    Argentina_ArroyoSeco2_7700BP    Argentina_ArroyoSeco2_7700BP
label    Chile_LosRieles_5100BP    Chile_LosRieles_5100BP
label    CHIle_LosRieles_10900BP    CHIle_LosRieles_10900BP

edge    a R Mbuti.DG
edge    b R nonAfrica
edge    c1 nonAfrica EastEurasia
edge    c2 EastEurasia EastAsia
edge    c3 EastEurasia Onge.DG
edge    c4 EastAsia EastAsia2
edge    c5 EastAsia EastAsia3
edge    c6 EastAsia2 Han.DG
edge    c7 EastAsia2 EastAsia4
edge    c8 nonAfrica WestEurasia
edge    c9 WestEurasia E_HG2
edge    c10 WestEurasia E_HG3
admix   ANE E_HG2 EastAsia3
edge    c11 ANE MA1
admix   FA EastAsia4 E_HG3
edge    c12 FA Beringia
edge    c13 Beringia USR1
edge    c14 Beringia NA
edge    c15 NA Canada_Lucier_4800BP_500BP
edge    c16 NA NA2
edge    c17 NA2 USA_SanNicolas_4900BP
edge    c18 NA2 CA
edge    c19 CA Belize_MayahakCabPek_9300BP
edge    c20 CA SA
edge    c21 SA SA2
edge    c22 SA SA3
edge    c23 SA2 SA6
edge    c24 SA2 CHIle_LosRieles_10900BP
edge    c25 SA6 PERu_Cuncaicha_9000BP
edge    c26 SA6 Peru_Lauricocha_8600BP
edge    c27 SA3 Brazil_LapaDoSanto_9600BP
edge    c28 SA3 SA4
edge    c29 SA4 Argentina_ArroyoSeco2_7700BP
edge    c30 SA4 Chile_LosRieles_5100BP

At the bottom of the newly produced .out file there are several f4-statistics that have Z-scores below -3 or above 3. The worst one is the following statistics that indicates some un-modelled affinity between the two Chilean samples CHIle_LosRieles_10900BP and Chile_LosRieles_5100BP:

worst f-stat:       Han        Chi        Bra        CHI       0.000000     0.003372     0.003372     0.000862     3.912

We can then model a contribution from the oldest Chilean individual into the younger one. Change the last part of FigureS5a with the following edges and one admixture event, save it as FigureS5a.v2 and run it again:

edge    c24 SA2 SA7
edge    c25 SA7 CHIle_LosRieles_10900BP
edge    c26 SA6 PERu_Cuncaicha_9000BP
edge    c27 SA6 Peru_Lauricocha_8600BP
edge    c28 SA3 Brazil_LapaDoSanto_9600BP
edge    c29 SA3 SA4
edge    c30 SA4 Argentina_ArroyoSeco2_7700BP
edge    c31 SA4 SA5
admix   SA8 SA7 SA5
edge    c32 SA8 Chile_LosRieles_5100BP

In the .out file we see that most of the outlier f4-statistics are gone while the worst statistic is still present but reduced (Zscore=3.2). This suggests the presence of un-modelled affinity between the two oldest South American groups Brazil_LapaDoSanto_9600BP and CHIle_LosRieles_10900BP, that might represent a shared Anzick-1-related ancestry that we investigate in detail in Figure 4 and Figure 5 of Posth et al. 2018. The resulting admixture graph suggests that a component from the oldest Chilean individual contributed at least marginally to the younger individual despite being more than 5,000 years apart!