Studying the Genetics of Participation Using Footprints Left on the Ascertained Genotypes

07/17/2023

nature.com

Genetic data

The UKBB has a Research Tissue Bank approval (Research Ethics Committee reference 21/NW/0157) from the North West Multicenter Research Ethics Committee, and all participants gave informed consent. For our data application, we used the UKBB 500K data release previously described by Bycroft et al.¹³. We filtered out individuals that had withdrawn consent, were not in the kinship inference and phasing input, had a duplicate/twin in sample, with excess of third degree relatives, with missing rate above 2% as well as heterozygosity and missingness outliers and those who showed potential sex chromosome aneuploidy or potential sex mismatch.

The current analysis started with 658,565 biallelic sequence variants in the UKBB phased haplotype data¹³. We refer to them as SNPs even though a very small fraction are short indels. Compared with standard GWAS, some of our analyses are more sensitive to data artefacts. Thus, we applied a number of filters to the SNPs in addition to those applied by Bycroft et al.¹³ (see Supplementary Note section 4 for details). Together with the trimming described below, association results were obtained for 500,632 high-quality SNPs.

Identifying relatives and the group of unrelateds

Using kinship coefficients from the UKBB data release, we identified 16,668 White British sib-pairs (42.4% male, mean (year of birth (YOB)) = 1950.8), and 4,427 White British parent–offspring pairs (parents: 31.0% male, mean (YOB) = 1941.7; offsprings: 38.9% male, mean (YOB) = 1965.0, Supplementary Note section 1). White British descent was determined from self-identified ancestry and PC analysis¹³. For sibling-ships with more than two siblings, the first two participating siblings were chosen. Sib-pairs and parent–offspring pairs were chosen to have no overlap. Within the White British subset, 272,409 individuals have no relatives within the UKBB of third degree or closer (46.7% male, mean (YOB) = 1951.3). Those are referred to as the ‘unrelateds’ here.

Inferring IBD segments and trimming

We inferred IBD segments for sibling pairs using snipar⁹, a program designed for sib-pairs. Compared with KING¹⁸ (v.2.2.4) a program not tailored for sib-pairs, snipar entails a substantial improvement in IBD estimation (Extended Data Fig. 2). Neither program uses phasing information as input. Considering the uncertainty around recombination events, we trimmed away 250 SNPs from the beginnings and ends of the inferred IBD segments before tabulating allele frequencies (Extended Data Fig. 3). This removed 11,000 SNPs all together (2 × 250 × 22) and reduced the sample sizes for the remaining SNPs (median reduction of 1,565 sibling pairs).

Inferring shared allele

For a biallelic SNP and a relative pair, there are five possible unordered combinations of genotypes for IBD1. The IBD shared allele is clear except when both individuals are heterozygous. For that, the phasing information provided¹³ was utilized. The closest neighboring SNP, for which one individual was heterozygous and the other homozygous, was identified and used to determine the shared haplotype and, through that, the shared allele of the target SNP (Extended Data Fig. 4).

Test statistics

For each SNP, we computed separate t-statistics for TNTC, WSPC and BSPC, by dividing each of the frequency differences, equations (1) to (3), by its s.e. For TNTC and WSPC, the s.e. values were computed assuming the shared versus not-shared frequency differences of each pair were independent of each other. For BSPC, for every SNP, the allele frequencies computed for individual sibling pairs, whether IBD0 or IBD2 pairs, were assumed to be independent of each other. No further assumptions, for example, HWE or independence of genotypes of two siblings in the IBD0 case, were made. Essentially, TNTC and WSPC were treated as one-sample t-tests (of differences), and BSPC were treated as a two-sample t-test. Specific equations are given in Supplementary Note section 2, which also contains information on sample sizes.

Sources of errors inducing the major-allele bias in the test statistics

When the initial results for all the SNPs were examined together, the t-statistics showed a tendency towards favouring the major allele as having higher frequency in the shared alleles. We identified three sources to this bias:

IBD estimation errors

For every sib-pair, the analysis starts with determining the IBD status for each SNP. Error here could lead to biases. This problem was addressed by replacing the KING¹⁸ program by snipar⁹, as noted above. Together with the trimming noted above, the impact of IBD estimation errors seems to be mostly eliminated (see Supplementary Note section 3 for further discussion).

Phasing errors

Phasing errors can induce a systematic major-allele bias in the WSPC and TNTC t-statistics. We focus more on WSPC below as it captures essential the same real effects as BSPC. The induced bias on TNTC is similar.

For TNTC and WSPC, when the genotypes of both relatives are heterozygous, to determine the shared allele, we use phased haplotypes that include neighboring SNPs. Phasing errors can thus induce a bias. Let the frequency of allele 1 be ${f}_{{\mathrm{pop}}}=f.$ For simplicity, assume the SNP is in HWE in the population and not associated with participation. When two siblings share one allele IBD and both are heterozygous, the chance that the shared allele is 1 is $\left(1-f\right).$ Let ${\varepsilon }_{f}$, a function of $f$, be the error rate of calling the shared allele 0 when it is actually 1. By symmetry, the error rate of calling the shared allele 1 when it is actually 0 is ${\varepsilon }_{1-f}$. Because the probability of the shared allele being actually 1 is $(1-f)$, the induced bias of the two type of errors combined is $f{\varepsilon }_{1-f}-\left(1-f\right){\varepsilon }_{f}$. For $f > 0.5$, the bias is positive if

$$\frac{{\varepsilon }_{f}}{{\varepsilon }_{1-f}} < \,\frac{f}{1-f}$$

(8)

Using data from 739 UKBB trios, where the shared allele between a parent–offspring pair in the double-heterozygotes case could be resolved by the genotype of the other parent, we estimated ${\varepsilon }_{f}$ (Extended Data Fig. 5), which satisfies equation (8). More details about the quantitative results and the induced bias on the t-statistics are given in Supplementary Note section 3. One observation, highlighted by Extended Data Fig. 6, is that, for SNPs with ${\mathrm{MAF}} > \,0.10$, around $\,50 \%$ of the observed bias of the WSPC t-statistics can be explained by the double-heterozygote errors. However, as MAF becomes small, the fraction of the empirical bias explainable by double-heterozygote errors decreases, for example, to 21% when ${\mathrm{MAF}}=\,0.01.\,$ We believe genotype-calling errors are responsible for the additional bias.

Genotyping errors

Genotyping errors can induce a systemic bias to the WSPC and TNTC statistics. Simulations show that random errors where each genotype has a small probability to be replaced by a random genotype drawn from the population would induce a bias on ${F}_{{\mathrm{IBD}}1\mathrm{S}}-{F}_{{\mathrm{IBD}}1{\mathrm{NS}}}$ which is positive if allele 1 has frequency >0.5. This happens even though such error mechanism would not even change the sampling distribution of the called genotypes. Furthermore, genotype error rate is known to be higher for SNPs with lower MAFs. We believe the main driving force of the major-allele bias is the minor allele being overcalled in the not-shared alleles (explanation in Supplementary Note section 3).

Two-step adjustment: adjustments of t-statistics and χ² statistics

For WSPC and TNTC, separately, we made a simple adjustment by regressing the t-statistics on centred allele frequency (cf), ${cf}=\left(\,f-0.5\right)$, through the origin and took the residuals as the adjusted values. This corresponds to deducting $(0.6464\times {cf}\,)$ and $(0.3781\times {cf}\,)$ from the unadjusted t-statistics of WSPC and TNTC, respectively. This adjustment avoids artificial association with other GWASs that are also subject to a major-allele bias, but resulting from a different mechanism. If the other GWASs exhibit major-allele biases for the same reasons, this adjustment would reduce, but not eliminate, an artificial association.

Effect of the major-allele bias on the χ² statistics

The errors underlying the major-allele bias of the t-statistics would also inflate the average values of the ${\chi }^{2}$ statistics. The t-statistic adjustment above will reduce, but not eliminate, this inflation. This is because the adjustment is based on allele frequency, that is, it removes the average bias of alleles with similar frequency. As SNPs with similar allele frequency will have biases that vary around the average, the variation of the adjusted t-statistics would still be inflated, and through that inflate the average value of the ${\chi }^{2}$ statistics. Notably, the average ${\chi }^{2}$ value for BSPC is 1.011 for the 500,632 SNPs, 1.025 for the 110,533 SNPs with ${\mathrm{MAF}} > 0.25$, and 1.007 for the 390,099 SNPs with ${\mathrm{MAF}} < 0.25.$ By contrast, for WSPC, the corresponding average ${\chi }^{2}$ values are 1.136, 1.039 and 1.163, respectively, without t-statistic adjustment, and 1.074, 1.029 and 1.086, respectively, after adjustment. With t-statistic adjustment, the average ${\chi }^{2}$ value for WSPC is only modestly inflated relative to BSPC for SNPs with ${\mathrm{MAF}} > \,0.25$, but remain substantially inflated for SNPs with ${\mathrm{MAF}} < \,0.25$. Moreover, while the ${\chi }^{2}$ statistics for BSCP are slightly positively correlated with MAF (r = 0.006), the ${\chi }^{2}$ statistics for WSCP, even with t-statistic adjustment, have a larger but negative correlation with MAF (r = −0.021). The latter is because the major-allele bias of an SNP increases as ${\mathrm{MAF}}$ decreases. The negative correlation between the WSPC ${\chi }^{2}$ values and ${\mathrm{MAF}}$ is problematic for the application of LD score regression as the LD scores have a substantial positive correlation ($r=0.35$) with MAF. Without further adjustments, applying LD score regression to the WSPC ${\chi }^{2}$ values gives a negative fitted slope, or a negative estimated heritability. To address this, we performed ${\mathrm{MAF}}$-specific genomic control. Specifically, we started by regressing the ${\chi }^{2}$ values computed from the adjusted t-statistics on polynomial of MAF up to the third power. The fitted values of the χ² values as a function of MAF are displayed in Extended Data Fig. 7. The ${\chi }^{2}$ values were then adjusted by dividing the original values by the fitted values. By construction, these ${\chi }^{2}$ values have average equal to 1. Taking the square-root and multiplying by the sign of the adjusted t-statistics gave the ‘final’ z-(or t-) scores of the WSPC GWAS. The same method was used for the TNTC GWAS. These z-scores were used to evaluate the significance of individual SNPs and to compute the polygenic scores used for Table 1. These adjustments reduce the impact of the artefacts, but the results are still imperfect (see Supplementary Note section 3 for further discussion). Thus, the LD score regression estimates of genetic correlations in Table 2 of the main text are based on the BSPC results only, and so is the heritability estimate given. However, the estimates in Supplementary Table 2, based on the adjusted WSPC results, are broadly consistent with those in Table 2, supporting that the adjustments have reduced the problems that arise in the application of LD score regression.

When we use the GWAS results to construct polygenic scores, we find the adjustments described above to have a very small effect on the predictive power of the polygenic scores. In particular, the polygenic score constructed from the WSPC t-statistics, with or without adjustments, has very similar predictive power as the polygenic score constructed from the BSPC t-statistics (Table 1 and Supplementary Table 1). This is because the bias is quite small per SNP for this filtered set and so would only be adding a little noise to the polygenic score prediction.

Polygenic score analysis

The pPGS was computed with PLINK 1.90 (ref. ²⁴) by summing over the weighted genotypes of the 500,632 SNPs fulfilling quality control, using the z- (or t-) statistics from the primary participation GWASs as weights. The relationship between the pPGS, standardized to have a variance of 1, and the phenotypes in Table 1 was estimated with a linear regression and logistic regression in R (v.3.4.3) in the group of White British unrelateds, taking sex, YOB, age at recruitment, genotyping array and 40 PCs into account. The quantitative phenotypes were transformed to have a variance of 1 for men and women separately (for further information see Supplementary Note section 5). To account for population stratification, the P value for each pPGS-phenotype association was adjusted through dividing the squared test statistic (t-test for quantitative phenotypes and z-test for binary phenotypes) by the LD score regression intercept estimated from GWAS summary statistics for the corresponding phenotype (see below).

LD score regression

We performed LD score regression with the program LDSC (v.1.0.1) (ref. ¹⁹) using the European Ancestry LD scores computed by the Pan-UKB team²³ (downloaded on 7 April 2021). Analyses were based on the 500,632 SNPs used to compute the pPGS.

LD score regression intercepts and genetic correlations were estimated for the primary participation test statistics described above and the phenotypes shown in Table 1. We obtained summary statistics for the phenotypes shown in Table 1 by running GWASs in the group of White British unrelated individuals in the UKBB using BOLT-LMM (v.2.3) (ref. ²⁵). For quantitative phenotypes, which had been adjusted for sex, age, YOB and 40 PCs and transformed to have variance of 1 as described in Supplementary Note section 5, genotyping array was added as an additional covariate in the GWASs. For the binary phenotypes, YOB, age at recruitment up to the order of three, 40 PCs, sex and genotyping array were added as covariates. For LD score regression, we used the standard linear regression P values, P_LINREG, from the BOLT output. Heritability of participation traits was estimated for a liability-threshold model (see below).

Liability-threshold model for participation of individuals and sib-pairs

The liability-threshold model assumes that a liability score, denoted here by $X$, underlies a 0/1 trait or response. $X$ is assumed to have a standard normal distribution (or roughly so). Let $I$ be the 0/1 participation variable, and participation rate is $P\left(I=1\right)=\alpha$. It is assumed that $I=1$ if $X > \, {\tau}$, where $\tau =\,{\Phi }^{-1}(1-\alpha )$ and $\Phi$ is the cumulative distribution of the standard normal. Correlation of participation between individuals is modeled through the correlation of their liability scores.

For $n$ sib-pairs, let $\,i=1,\ldots ,n$ index the pairs and $j=1,\,2$ index the two siblings in a pair. Focusing on one SNP with its standardized genotype denoted by g, the liability of sibling ij is modeled as

$${X}_{ij}=\,{{w}_{1}g}_{ij}+\,{w}_{A}{A}_i+{w}_{B}{B}_{ij}$$

(9)

where ${A}_i$ and B_ij are standard normal variables. The variables A_i, B_i₁ and B_i₂ are assumed to be independent of g_i₁ and g_i2, and each other. A_i captures effects from shared environment as well as shared genetic factors other than g. We assume ${w}_{1}^{2}+{w}_{A}^{2}+{w}_{B}^{2}=1$ so that var(X_ij) = 1. Because cor(g_i₁, g_i₂) = 1/2, ${\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)={w}_{A}^{2}+{w}_{1}^{2}/2.\,$ Here X_ij is not exactly normally distributed because of g_ij, but it is close if ${w}_{1}$ is small.

Simulations to study relationships between various frequency differences of an SNP

Results in Fig. 4 were simulated with $\alpha =0.055$, the participation rate of UKBB. Allele 1 of the SNP is assumed to have a population frequency of 0.5 and a positive participation effect with ${w}_{1}^{2}=0.001$. Sixteen different values of w_A are chosen so that ${\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)=\,{w}_{A}^{2}+{w}_{1}^{2}/2$ takes on values of 0.0005, 0.025, 0.050, 0.075, 0.100, 0.125, 0.150, 0.175, 0.193, 0.225, 0.250, 0.275, 0.300, 0.325, 0.350 and 0.375. Notably, ${\mathrm{cor}}\left({X}_{i1},\,{X}_{i2}\right)=0.193$ leads to ${\lambda }_{S}=2.0$, the reported sibling enrichment in UKBB¹³.

Simulations for dominant and recessive models were performed similarly. For the dominant model, the g_ij in the liability score definition is taken as the standardized version of a 0/1 variable, which is 1 if the actual genotype is 1 or 2, and 0 otherwise. For the recessive model, g_ij is the standardized version of a 0/1 variable, which is 1 if the actual genotype is 2.

Estimating heritability for liability scores

Here, we show how the heritability of the participation traits were estimated, starting with secondary participation. When the LDSC program is given the ${\chi }^{2}$ statistics and sample sizes for a set of genome-wide SNPs, the heritability estimate produced is an estimate of the fraction of variance of the trait accountable by the genetic component, or ${r}^{2}$ between trait and the genetic component. If genetic component $G$ influences the participation variable $I$ through liability score $X$, then

$${\mathrm{cor}}\left(G,I\right)={\mathrm{cor}}\left(G,X\right)\times {\mathrm{cor}}\left(X,I\right)$$

(10)

Heritability of $X$ and $I$ thus satisfy

$${h}^{2}\left(X\right)=\,\frac{{h}^{2}\left(I\right)}{{\mathrm{cor}}^{2}\left(X,\,I\right)}$$

(11)

with

$${\mathrm{cor}}(X,I)=\,\frac{E({XI})-E(X)E(I)}{\surd ({\mathrm{var}}(X){\mathrm{var}}(I))}=\,\frac{\alpha E(X|I=1)}{\sqrt{\alpha (1-\alpha )}}=\,\frac{\alpha \frac{\phi (\tau )}{1-\Phi (\tau )}}{\sqrt{\alpha (1-\alpha )}}=\frac{\phi (\tau )}{\sqrt{\alpha (1-\alpha )}}$$

(12)

where $\phi$ is the density function of the standard normal and E stands for expectation. Consider the Dietary Study invitation event. Feeding the LDSC program the ${\chi }^{2}$ statistics from the GWAS analyses and a sample size of 272,409 (166,993 invited; 105,416 not-invited), the heritability estimate is 0.0372. Here $\alpha =\,166993/272409=0.613,$$\tau =\,{\Phi }^{-1}\left(1-0.613\right)=\,$−0.287, and ${\mathrm{cor}}\left(X,{I}\right)$ is 0.786. The heritability of ${X}$ is estimated as the estimated heritability of ${I}$ multiplied by the adjustment factor $[1/{\mathrm{cor}}^{2}\left(X,I\right)]$, or

$$0.0372\,\times \frac{1}{{0.786}^{2}}=0.060$$

(13)

The heritability of the liability scores of the other secondary participation events are similarly estimated. The ${r}^{2}$ between a genetic component, or the genotype of an individual SNP, is weaker, or statistically less efficient, with ${I}$ than with $X$. In this sense, the adjustment factor is inversely proportional to the statistical efficiencies of the test statistics used, a principle that also applies to the two other adjustments described below. Moreover, Supplementary Note section 7 describes how the adjustment could be alternatively applied through providing the LDSC program with modified sample sizes.

With primary participation, heritability estimation requires the adjustment step above plus two others because the BSPC results are not direct comparisons of genotypes of participants and nonparticipants. Let ${n}_{{\mathrm{IBD}}2}$ and ${n}_{{\mathrm{IBD}}0}$ be the respective number of IBD2 and IBD0 pairs at the SNP location. Feeding the LDSC program the ${\chi }^{2}$ statistics from the BSPC GWAS with number of ‘cases’ equals ${n}_{{\mathrm{IBD}}2}$ and number of controls equals $2\times {n}_{{\mathrm{IBD}}0}$, the estimated heritability given is 0.0838. Here $\alpha =0.055$, $\tau =1.598$ and ${\mathrm{cor}}\left(X,I\right)$ = 0.488. Hence, the first adjustment factor is ${\left(1/0.488\right)}^{2}=\,4.2.\,$ However, the numbers of participants and nonparticipants have a 0.055 to 0.945 ratio. By contrast, the cases to controls ratio for BSPC is around 1:2, much more efficient as the variance of the comparison is roughly proportional to

$$\frac{1}{{n}_{{\mathrm{cases}}}}+\,\frac{1}{{n}_{{\mathrm{controls}}}}$$

(14)

As variance here is inversely proportional to efficiency, this leads to the adjustment factor, for arbitrary $n,$

$$\frac{\frac{1}{n\times \left(1/3\right)}+\frac{1}{n\times \left(2/3\right)}}{\frac{1}{n\times \left(0.055\right)}+\frac{1}{n\times \left(0.945\right)}}=\,\frac{3+\left(\frac{3}{2}\right)}{18.18+1.06}=0.2338$$

(15)

There is a third adjustment because the allele frequency difference between the IBD2 sibs and the IBD0 sibs is smaller than the frequency difference between the participants and nonparticipants. Specifically, for $\alpha =0.055$ and ${\lambda }_{S}=2.0$, our simulations gave

$$E\left[\frac{{F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}}{{F}_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}}}\right]\approx 0.86$$

(16)

Since $(F_{{\mathrm{samp}}}-{f}_{{\mathrm{pop}}})=\left(1-\alpha \right)\left({F}_{{\mathrm{samp}}}-{F}_{{\mathrm{nonparticipants}}}\right)$

$$E\left[\frac{{F}_{{\mathrm{IBD}}2}-{F}_{{\mathrm{IBD}}0}}{{F}_{{\mathrm{samp}}}-{F}_{{\mathrm{nonparticipants}}}}\right]\approx 0.86\times \,\left(1-\alpha \right)=0.86\times 0.945=0.8127$$

(17)

Because efficiency is proportional to effect², the adjustment factor is

$${\left(\frac{1}{0.8127}\right)}^{2}=1.514$$

(18)

With the three adjustments, the estimated heritability of primary participation is

$$0.0838\times 4.2\times 0.2338\times 1.514=0.125$$

(19)

or 12.5%.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

ReachMD

Studying the Genetics of Participation Using Footprints Left on the Ascertained Genotypes

Genetic data

Identifying relatives and the group of unrelateds

Inferring IBD segments and trimming

Inferring shared allele

Test statistics

Sources of errors inducing the major-allele bias in the test statistics

IBD estimation errors

Phasing errors

Genotyping errors

Two-step adjustment: adjustments of t-statistics and χ² statistics

Effect of the major-allele bias on the χ² statistics

Polygenic score analysis

LD score regression

Liability-threshold model for participation of individuals and sib-pairs

Simulations to study relationships between various frequency differences of an SNP

Estimating heritability for liability scores

Reporting summary

Facebook Comments

Program Chapters

Segment Chapters

Playlist:

Recommended

Studying the Genetics of Participation Using Footprints Left on the Ascertained Genotypes

Genetic data

Identifying relatives and the group of unrelateds

Inferring IBD segments and trimming

Inferring shared allele

Test statistics

Sources of errors inducing the major-allele bias in the test statistics

IBD estimation errors

Phasing errors

Genotyping errors

Two-step adjustment: adjustments of t-statistics and χ2 statistics

Effect of the major-allele bias on the χ2 statistics

Polygenic score analysis

LD score regression

Liability-threshold model for participation of individuals and sib-pairs

Simulations to study relationships between various frequency differences of an SNP

Estimating heritability for liability scores

Reporting summary

Facebook Comments

Title

Share on ReachMD

Program Chapters

Segment Chapters

Playlist:

Recommended

Get a Dose of ReachMD in Your Inbox and Practice Smarter Medicine

Two-step adjustment: adjustments of t-statistics and χ² statistics

Effect of the major-allele bias on the χ² statistics