Matches in SemOpenAlex for { <https://semopenalex.org/work/W2952535009> ?p ?o ?g. }
- W2952535009 endingPage "49" @default.
- W2952535009 startingPage "37" @default.
- W2952535009 abstract "The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits and has great potential to enable identification of genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Because a PheWAS can test thousands of binary phenotypes, and most of them have unbalanced or often extremely unbalanced case-control ratios (1:10 or 1:600, respectively), existing methods cannot provide an accurate and scalable way to test for associations. Here, we propose a computationally fast score-test-based method that estimates the distribution of the test statistic by using the saddlepoint approximation. Our method is much (∼100 times) faster than the state-of-the-art Firth’s test. It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls. The availability of electronic health record (EHR)-based phenotypes allows for genome-wide association analyses in thousands of traits and has great potential to enable identification of genetic variants associated with clinical phenotypes. We can interpret the phenome-wide association study (PheWAS) result for a single genetic variant by observing its association across a landscape of phenotypes. Because a PheWAS can test thousands of binary phenotypes, and most of them have unbalanced or often extremely unbalanced case-control ratios (1:10 or 1:600, respectively), existing methods cannot provide an accurate and scalable way to test for associations. Here, we propose a computationally fast score-test-based method that estimates the distribution of the test statistic by using the saddlepoint approximation. Our method is much (∼100 times) faster than the state-of-the-art Firth’s test. It can also adjust for covariates and control type I error rates even when the case-control ratio is extremely unbalanced. Through application to PheWAS data from the Michigan Genomics Initiative, we show that the proposed method can control type I error rates while replicating previously known association signals even for traits with a very small number of cases and a large number of controls. Over the last decade, genome-wide association studies (GWASs) have proved instrumental to unravelling the genetic complexities of hundreds of diseases and traits and their associations with common genomic variations. To date, thousands of GWASs have identified more than 4,000 significant loci to be associated with human diseases and traits.1Welter D. MacArthur J. Morales J. Burdett T. Hall P. Junkins H. Klemm A. Flicek P. Manolio T. Hindorff L. Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.Nucleic Acids Res. 2014; 42: D1001-D1006Crossref PubMed Scopus (1997) Google Scholar However, because most GWASs investigate a single disease or trait, they cannot exploit the cross-phenotype associations or pleiotropy,2Solovieff N. Cotsapas C. Lee P.H. Purcell S.M. Smoller J.W. Pleiotropy in complex traits: challenges and strategies.Nat. Rev. Genet. 2013; 14: 483-495Crossref PubMed Scopus (641) Google Scholar where a single genetic variant can be associated with multiple phenotypes. The phenome-wide association study (PheWAS) has been proposed as an alternative approach to take advantage of the pleiotropy phenomenon by studying the impact of genetic variations across a broad spectrum of human phenotypes or “phenome.” It is a complementary approach to the GWAS in the sense that whereas a GWAS attempts to identify phenotype-to-genotype associations, a PheWAS uses a genotype-to-phenotype approach. The first PheWAS3Denny J.C. Ritchie M.D. Basford M.A. Pulley J.M. Bastarache L. Brown-Gentry K. Wang D. Masys D.R. Roden D.M. Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.Bioinformatics. 2010; 26: 1205-1210Crossref PubMed Scopus (636) Google Scholar was published as a proof-of-principle study that demonstrated that the PheWAS strategy could be applied to successfully identify the expected gene-disease associations. Additional studies4Denny J.C. Crawford D.C. Ritchie M.D. Bielinski S.J. Basford M.A. Bradford Y. Chai H.S. Bastarache L. Zuvich R. Peissig P. et al.Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies.Am. J. Hum. Genet. 2011; 89: 529-542Abstract Full Text Full Text PDF PubMed Scopus (191) Google Scholar, 5Hebbring S.J. Schrodi S.J. Ye Z. Zhou Z. Page D. Brilliant M.H. A PheWAS approach in studying HLA-DRB1∗1501.Genes Immun. 2013; 14: 187-191Crossref PubMed Scopus (64) Google Scholar, 6Ritchie M.D. Denny J.C. Zuvich R.L. Crawford D.C. Schildcrout J.S. Bastarache L. Ramirez A.H. Mosley J.D. Pulley J.M. Basford M.A. et al.Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) QRS GroupGenome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk.Circulation. 2013; 127: 1377-1385Crossref PubMed Scopus (135) Google Scholar, 7Pendergrass S.A. Brown-Gentry K. Dudek S. Frase A. Torstenson E.S. Goodloe R. Ambite J.L. Avery C.L. Buyske S. Bůžková P. et al.Phenome-wide association study (PheWAS) for detection of pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network.PLoS Genet. 2013; 9: e1003087Crossref PubMed Scopus (116) Google Scholar, 8Shameer K. Denny J.C. Ding K. Jouni H. Crosslin D.R. de Andrade M. Chute C.G. Peissig P. Pacheco J.A. Li R. et al.A genome- and phenome-wide association study to identify genetic variants influencing platelet count and volume and their pleiotropic effects.Hum. Genet. 2014; 133: 95-109Crossref PubMed Scopus (98) Google Scholar have shown that the PheWAS approach can further identify previously unreported disease-SNP associations.9Hebbring S.J. The challenges, advantages and future of phenome-wide association studies.Immunology. 2014; 141: 157-165Crossref PubMed Scopus (98) Google Scholar The PheWAS approach depends on the availability of detailed phenotypic information. Currently, most PheWASs are applied to clinical cohorts linked to electronic health records (EHRs) and utilize the International Classification of Disease (ICD) billing codes to define clinical phenotypes. The ICD codes provide intuitive phenotype ordering based on clinical disease and trait classifications. Given that the current genotyping and imputation technologies allow for genotyping of tens of millions of variants at a very low cost,10Marchini J. Howie B. Genotype imputation for genome-wide association studies.Nat. Rev. Genet. 2010; 11: 499-511Crossref PubMed Scopus (1079) Google Scholar an extensive PheWAS can attempt to investigate the genotype-phenotype associations by performing genome-wide association analyses in thousands of traits. We can interpret the PheWAS result of a single genetic variant by observing its associations across the phenome. Such a PheWAS is exhaustive in nature and has great potential to identify variants associated with clinical diseases. One of the main challenges of the PheWAS approach is that most of the phenotypes are binary phenotypes with unbalanced (1:5) or often extremely unbalanced (1:600) case-control ratios (see Figure S1), given that the data are collected in cohorts. Although standard asymptotic tests (such as the Wald, score, and likelihood-ratio tests) are relatively well calibrated and asymptotically equivalent11Cox D. Hinkley D. Theoretical Statistics. Chapman and Hall, 1974Crossref Google Scholar for common (minor allele frequency [MAF] > 0.05) variants in balanced case-control studies, they can inflate type I error for low-frequency (0.01 < MAF ≤ 0.05) and rare (MAF ≤ 0.01) variants in unbalanced case-control studies.12Ma C. Blackwell T. Boehnke M. Scott L.J. GoT2D investigatorsRecommended joint and meta-analysis strategies for case-control association testing of single low-count variants.Genet. Epidemiol. 2013; 37: 539-550Crossref PubMed Scopus (92) Google Scholar Moreover, because the Wald and likelihood-ratio tests need to calculate the likelihood or the maximum-likelihood estimator under the full model, which is computationally expensive, they are not scalable for the amount of tests that PheWASs attempt. On the other hand, the score test is computationally efficient because it does not need to calculate the maximum likelihood under the full model. However, as mentioned before, it suffers from having highly inflated type I error rates in unbalanced studies. Ma et al. proposed Firth’s penalized likelihood-ratio test13Firth D. Bias reduction of maximum likelihood estimates.Biometrika. 1993; 80: 27-38Crossref Scopus (2533) Google Scholar as a solution to control the type I error rates in such situations. Firth’s test, despite being well calibrated and robust for testing low-frequency and rare variants in unbalanced studies, lacks computational efficiency because it also involves calculating the maximum likelihood under the full model. For instance, the projected computation time for testing 1,500 phenotypes across 10 million SNPs is ∼117 CPU years (2,000 cases and 18,000 controls). Thus, it is impractical to apply Firth’s test for analyzing large PheWAS datasets. In this paper, we propose a score-based single-variant test for binary phenotypes that is well calibrated for controlling type I error and can adjust for covariates even in extremely unbalanced case-control studies. Moreover, our test is computationally efficient and scalable to testing thousands of phenotypes across millions of SNPs in large PheWAS datasets. Our proposed test (SPA) is based on score statistics and estimates the null distribution by using the saddlepoint approximation14Daniels H.E. Saddlepoint Approximations in Statistics.Ann. Math. Stat. 1954; 25: 631-650Crossref Google Scholar, 15Barndorff-Nielsen O.E. Approximate Interval Probabilities.J. R. Stat. Soc. B. 1990; 52: 485-496Google Scholar, 16Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables.Biometrika. 1999; 86: 929-935Crossref Scopus (96) Google Scholar instead of the normal approximation17Feller W. The fundamental limit theorems in probability.Bull. Amer. Math. Soc. 1945; 51: 800-832Crossref Scopus (89) Google Scholar traditionally used in score tests. We further develop an improvement of our test (fastSPA) that renders the most computationally challenging steps dependent only on the number of carriers (subjects with at least one minor allele) rather than the sample size. This improved test can substantially reduce the computation time, especially for low-frequency and rare variants, where the number of carriers is much lower than the sample size. Our method’s projected computation time for testing 1,500 phenotypes across 10 million SNPs is ∼400 CPU days (2,000 cases and 18,000 controls), which is more than a 100 times better than that of Firth’s test. In addition, through extensive simulation studies and analysis of the Michigan Genomics Initiative (MGI) data, we demonstrate that the proposed approach can control type I error and is powerful enough to replicate known association signals. We consider a case-control study with sample size n. For the ith subject, let Yi = 1 or 0 denote the case-control status, Xi denote the k × 1 vector of non-genetic covariates (including the intercept), and Gi denote the number of minor alleles (Gi = 0, 1, 2) of the variant to be tested. To relate genotypes to phenotypes, we use the following logistic regression model:logit[Pr(Yi=1|Xi,Gi)]=XiTβ+Giγfori=1,2,…,n,(Equation 1) where β is a k × 1 vector of coefficients of the covariates, and γ is the genotype log odds ratio. Under this model, we are interested in testing for the genetic association by testing the null hypothesis H0: γ = 0. Let μˆi be the estimate of μi = Pr(Yi = 1|Xi), which is the probability of being a case under H0. A score statistic for γ from the model (Equation 1) is given by S=∑i=1nGi(Yi−μˆi). Suppose X=(X1T,…,XnT) is the n × k matrix of covariates, G=(G1,…,Gn)T is the genotype vector, W is a diagonal matrix with μˆi(1−μˆi) as the ith diagonal element, and G˜=G−X(XTWX)−1XTWG is a covariate-adjusted genotype vector in which covariate effects are projected out from the genotypes (details are given in Appendix A). Then, S can be written asS=∑i=1nG˜i(Yi−μˆi),(Equation 2) and the mean and variance of S under H0 are EH0(S)=0 and VH0(S)=∑i=1nG˜i2μˆi(1−μˆi), respectively, where G˜i is the ith element of G˜. The traditional score test approximates the null distribution by using a normal distribution, which depends only on the mean and the variance of the score statistic. We can obtain the p value by comparing the observed test statistic (s) and N(0,VH0(S)). Normal approximation works well near the mean of the distribution but performs very poorly at the tails. The performance is especially poor when the underlying distribution is highly skewed, such as in unbalanced case-control outcomes,12Ma C. Blackwell T. Boehnke M. Scott L.J. GoT2D investigatorsRecommended joint and meta-analysis strategies for case-control association testing of single low-count variants.Genet. Epidemiol. 2013; 37: 539-550Crossref PubMed Scopus (92) Google Scholar because normal approximation cannot incorporate higher moments such as skewness. In addition, the convergence rate of normal approximation18Berry A.C. The accuracy of the Gaussian approximation to the sum of independent variates.Trans. Am. Math. Soc. 1941; 49: 122-136Crossref Google Scholar, 19Esseen C.G. On the Liapounoff Limit of Error in the Theory of Probability.Ark. Mat. Astr. Fys. 1942; 28A: 1-19Google Scholar, 20Esseen C.G. A Moment Inequality with an Application to the Central Limit Theorem.Skand Aktuarietidskr. 1956; 39: 160-170Google Scholar is O(n−1/2), which is not fast enough for rare variants. Saddlepoint approximation was introduced by Daniels14Daniels H.E. Saddlepoint Approximations in Statistics.Ann. Math. Stat. 1954; 25: 631-650Crossref Google Scholar as an improvement over the normal approximation. Contrary to normal approximation, where only the first two cumulants (mean and variance) are used for approximating the underlying distribution, saddlepoint approximation uses the entire cumulant-generating function (CGF). Jensen21Jensen J.L. Saddlepoint Approximations. Oxford University Press, 1995Google Scholar further showed that saddlepoint approximation has a relative error bound of O(n−3/2), making it a considerable improvement over the normal approximation. To use the saddlepoint approximation, we first derive the CGF of S from the fact that Yi ∼ Bernoulli (μi) under H0. Let μˆ be an n × 1 vector with μˆi as the ith element. From Equation 2, the estimate of the CGF of the score statistic S isK(t)=log(EH0(etS))=∑i=1nlog(1−μˆi+μˆieG˜it)−t∑i=1nG˜iμˆi,and the estimates of the first- and second-order derivatives of K areK′(t)=∑i=1nμˆiG˜i(1−μˆi)e−G˜it+μˆi−∑i=1nG˜iμˆiandK′′(t)=∑i=1n(1−μˆi)μˆiG˜i2e−G˜it[(1−μˆi)e−G˜it+μˆi]2,respectively. We note that K, K′, and K″ are plug-in estimates in which we plug in μˆi instead of μi. Then, according to the saddlepoint method (Barndorff-Nielson15Barndorff-Nielsen O.E. Approximate Interval Probabilities.J. R. Stat. Soc. B. 1990; 52: 485-496Google Scholar, 16Kuonen D. Saddlepoint approximations for distributions of quadratic forms in normal variables.Biometrika. 1999; 86: 929-935Crossref Scopus (96) Google Scholar), the distribution of S at s can be approximated byPr(S<s)≈F˜(s)=Φ{w+1wlog(vw)},where w=sgn(tˆ)2(tˆs−K(tˆ)),v=tˆK′′(tˆ), tˆ is the solution to the equation K′(tˆ)=s, and Φ is the distribution function of a standard normal distribution. The saddlepoint approximation method involves finding the root of the saddlepoint equation K′(t)=s. It is easy to verify that K′ strictly increases as K″(t) > 0 for all −∞ < t < ∞, and s=∑i=1nG˜i(Yi−μˆi) lies between limt→∞K′(t)=∑i:G˜i>0G˜i−∑i=1nG˜iμˆi and limt→−∞K′(t)=∑i:G˜i<0G˜i−∑i=1nG˜iμˆi. Therefore, a unique root exists, and we can use popular root-finding algorithms (Newton-Raphson,22Whittaker E.T. Robinson G. The Newton-Raphson Method.in: The Calculus of Observations: A Treatise on Numerical Mathematics. Fourth Edition. Dover, 1967: 84-87Google Scholar, 23Press W.H. Flannery B.P. Teukolsky S.A. Vetterling W.T. Numerical Recipes in Fortran 77: The Art of Scientific Computing.Second Edition. Cambridge University Press, 1992Google Scholar bisection,23Press W.H. Flannery B.P. Teukolsky S.A. Vetterling W.T. Numerical Recipes in Fortran 77: The Art of Scientific Computing.Second Edition. Cambridge University Press, 1992Google Scholar secant,23Press W.H. Flannery B.P. Teukolsky S.A. Vetterling W.T. Numerical Recipes in Fortran 77: The Art of Scientific Computing.Second Edition. Cambridge University Press, 1992Google Scholar and Brent’s method24Brent R.P. Algorithms for Minimization without Derivatives. Prentice-Hall, 1973Google Scholar) to efficiently solve this equation. For our simulation studies and real-data applications, we applied a combination of the Newton-Raphson and bisection method to solve the saddlepoint equations. The most computationally demanding step in this saddlepoint approximation method is calculating the CGF and its derivatives. Here, we propose several approaches to reducing the computational complexities associated with these calculations. The most computationally intensive step in the saddlepoint method is the calculation of the CGF K and its derivatives. In each step of the root-finding algorithm, we need to calculate K, K′, and K″, each of which needs O(n) computations. Using the fact that many elements of G are zeroes (i.e., homozygous major genotypes), we propose a fast computation method that speeds up the computation to O(m), where m is the number of non-zero elements in G. Without loss of generality, we assume that the first m subjects have at least one minor allele each and the rest have homozygous major genotypes. We can then express S as S = S1 + S2, where S1=∑i=1mG˜i(Yi−μˆi) and S2=∑i=m+1nG˜i(Yi−μˆi). Let Z=(XTWX)−1XTWG, and let Zl be the lth element of Z. Then, we can further express S2 asS2=∑i=m+1nG˜i(Yi−μˆi)=∑i=m+1n(0−XiZ)(Yi−μˆi)=−∑i=m+1n∑l=1kXilZl(Yi−μˆi)=−∑l=1kZl∑i=m+1nXil(Yi−μˆi)=−∑l=1kZlS2l,where S2l=∑i=m+1nXil(Yi−μˆi). Now, if we assume that the non-genetic covariates are relatively balanced in the sample, then the normal distribution should be a good approximation of the null distribution of each S2l. Because S2 is a weighted sum of the S2l variables, we can also approximate the null distribution of S2 by using a normal distribution where the mean and variance under H0 are given by EH0(S2)=0 and VH0(S2)=∑i=m+1nG˜i2μˆi(1−μˆi), respectively. Then, the CGF of S2 can be approximated byK2(t)=12t2VH0(S2),and the CGF of S = S1 + S2 can be approximated byK(t)=∑i=1mlog(1−μˆi+μˆieG˜it)−t∑i=1mG˜iμˆi+12t2VH0(S).(Equation 3) In order to calculate the first two terms on the right side of Equation 3, we will need G˜i values for i = l, …, m, which can be calculated in O(m) computations given that G has only m non-zero elements and the quantity X(XTWX)−1XTW can be pre-calculated. Then, the first two terms will require only O(m) computations because both of them sum over m elements. Next, the variance VH0(S) can be further broken down intoVH0(S)=∑i=m+1nG˜i2μˆi(1−μˆi)=∑i=m+1n(XiZ)2μˆi(1−μˆi)=∑i=1n(XiZ)2μˆi(1−μˆi)−∑i=1m(XiZ)2μˆi(1−μˆi)=ZT(XTWX)Z−∑i=1m(XiZ)2μˆi(1−μˆi). Because XTWX can be pre-calculated and Z is a k × 1 vector, the first term requires O(k) computations, and the second term requires O(m) computations, which implies that the calculation of VH0(S2) requires O(m) calculations under the assumption that k < m, i.e., the number of non-genetic covariates is smaller than the number of subjects with at least one minor allele each. Hence, the CGF K(t) can be calculated in O(m) computations. Using similar arguments, we can further show that the derivatives K′(t) and K″(t) can also be calculated in O(m) computations. Therefore, this partially normal approximation reduces the computational complexity of our test from O(n) to O(m), which is especially useful for rare variants, where m is much smaller than n. Because the normal approximation behaves well near the mean of the distribution, we can use it to obtain the p value when the observed score statistic (s) lies close to the mean (0). Moreover, saddlepoint approximation can be numerically unstable very close to the mean of the distribution. We can also avoid such situations by using normal approximation near the mean. One possible approach is to use a fixed threshold in which we apply normal approximation to obtain the p value if the absolute value of the observed score statistic, |s| < rσ, where σ=VH0(S) and r is a pre-specified value. For example, we used r = 2 in our simulation studies and real-data analyses. For a given level α, this approach does not inflate type I error rates if r < Φ−1(1 − α/2), where Φ−1 is the inverse function of the standard normal distribution function, Φ(x). Alternatively, we can adaptively select the threshold by using the error bound of the normal approximation given by the Berry-Esseen theorem. Suppose we are interested in controlling the type I error rate at level α. Let Fn(x) be the true distribution function of the standardized score test statistic S/VH0(S). Then, according to Berry-Esseen theorem,18Berry A.C. The accuracy of the Gaussian approximation to the sum of independent variates.Trans. Am. Math. Soc. 1941; 49: 122-136Crossref Google Scholar, 19Esseen C.G. On the Liapounoff Limit of Error in the Theory of Probability.Ark. Mat. Astr. Fys. 1942; 28A: 1-19Google Scholar, 20Esseen C.G. A Moment Inequality with an Application to the Central Limit Theorem.Skand Aktuarietidskr. 1956; 39: 160-170Google Scholar the maximum error bound in approximating Fn(x) by Φ(x) issupx∈R|Fn(x)−Φ(x)|≤Bn=C(σ2)−3/2(∑i=1nρi),(Equation 4) where ρi=EH0[|G˜i(Yi−μˆi)|3]=G˜i3μˆi(1−μˆi)[μˆi2+(1−μˆi2)] and C is a constant. As of now, the best-known estimate for C is 0.56, given by Shevtsova.25Shevtsova I.G. An improvement of convergence rate estimates in the Lyapunov theorem.Dokl. Math. 2010; 82: 862-864Crossref Scopus (61) Google Scholar Suppose pF and pN are Fn(x)- and Φ(x)-based p values, respectively. From the Berry-Esseen theorem, we can show pN ≤ pF + Bn. Suppose q = Bn + α/2 and rα = Φ−1(1 − q). Then, pN ≥ q indicates pF ≥ α/2. Therefore, we use rασ as a threshold at level α in which we will apply normal approximation if |s| < rασ. To evaluate the computation times, type I error rates, and power of the proposed method, we carried out extensive simulation studies. We considered three different case-control ratios: balanced with 10,000 cases and 10,000 controls, moderately unbalanced with 2,000 cases and 18,000 controls, and extremely unbalanced with 40 cases and 19,960 controls. For each choice of case-control ratio, the phenotypes were simulated on the basis of the following logistic model:logit[Pr(Yi=1)]=β0+X1i+X2i+γGi,where the two non-genetic covariates X1i and X2i were simulated from X1i ∼ Bernoulli (0.5) and X2i ∼ N(0, 1), respectively. The intercept β0 was chosen to correspond to a prevalence of 0.01. The genotype Gi values were generated from a binomial(2, p) distribution where p was the MAF. The parameter γ represents the genotype log odds ratio. To estimate computation times and type I error rates in realistic scenarios, we randomly sampled the MAF (p) from the MAF distribution in the MGI data. To compare computation times, we simulated 104 variants with γ = 0. To compare type I error rates, we simulated 109 variants with γ = 0 and recorded the number of rejections at α = 5 × 10−5 and 5 × 10−8. We also used fixed MAFs to evaluate the effect of MAF on computation time and type I error rates. For the power calculations, we considered two different choices for MAF (p = 0.01 and 0.05) and wide ranges of γ (Figure 4). For each choice of p and γ, we generated 5,000 variants. We compared the computation times of seven different tests: a traditional score test using normal approximation (Score), the saddlepoint-approximation-based test with a standard-deviation threshold at 0.1 and 2 (SPA-0.1 and SPA-2, respectively), the fast saddlepoint-approximation-based test with the partially normal approximation improvement and a standard-deviation threshold at 0.1 and 2 (fastSPA-0.1 and fastSPA-2, respectively), the fastSPA test with the Berry-Esseen bound threshold at α = 5 × 10−8 (fastSPA-BE), and Firth’s penalized likelihood-ratio test. Next, we compared the empirical type I errors and power curves for fastSPA-2, Score, and Firth’s test at 5 × 10−8. Because performing Firth’s test 109 times, which is required for estimating type I error rates at 5 × 10−8, is practically impossible given the heavy computational burden, we performed a hybrid approach in which we used Firth’s test only when the fastSPA-2 p values were smaller than 5 × 10−3. For the power comparison, because Score has extremely inflated type I errors in the unbalanced and extremely unbalanced case-control scenarios (as shown in the Results), it might not be appropriate to directly compare the power of Score with that of the other two tests at the same nominal α level. In order to provide a more meaningful comparison, we compared their powers at the empirical α levels where their empirical type I errors became 5 × 10−8. The empirical α levels were selected on the basis of the type I error simulations, whereby variants were simulated with MAF randomly sampled from the MAF distribution of the MGI data. This approach is similar to performing resampling (e.g., permutation) to control family-wise error rates. We also estimated the powers at the nominal fixed α = 5 × 10−8. In order to compare the p values resulting from different tests, we also simulated 5 × 10−6 variants with MAFs randomly sampled from the MAF distribution of the MGI data. We further compared the inflation factors of the genomic controls at different p value quantiles for fastSPA-2, fastSPA-BE, and fastSPA-0.1 in order to explore the effect of the standard-deviation threshold on the inflation factor. To illustrate the performance of the proposed methods in real-data application, we analyzed four selected phenotypes in the MGI data. The main goal of MGI is to create an institutional repository of genetic data together with rich clinical phenotypes for a broad portfolio of future medical research. DNA from blood samples of >20,000 individuals who underwent surgical procedures at the University of Michigan Health System was genotyped (with their informed consent) on the Illumina HumanCoreExome v.12.1 array, which is a combined GWAS plus exome array composed of >500,000 SNPs. Genotypes of the Haplotype Reference Consortium26McCarthy S. Das S. Kretzschmar W. Delaneau O. Wood A.R. Teumer A. Kang H.M. Fuchsberger C. Danecek P. Sharp K. et al.Haplotype Reference ConsortiumA reference panel of 64,976 haplotypes for genotype imputation.Nat. Genet. 2016; 48: 1279-1283Crossref PubMed Scopus (1387) Google Scholar (chromosomes 1–22: HRC release 1; chromosome X: HRC release 1.1) were imputed into the phased MGI genotypes (SHAPEIT227Delaneau O. Zagury J.-F. Marchini J. Improved whole-chromosome phasing for disease and population genetic studies.Nat. Methods. 2013; 10: 5-6Crossref PubMed Scopus (866) Google Scholar on autosomal chromosomes and Eagle228Loh P.-R. Danecek P. Palamara P.F. Fuchsberger C. A Reshef Y. K Finucane H. Schoenherr S. Forer L. McCarthy S. Abecasis G.R. et al.Reference-based phasing using the Haplotype Reference Consortium panel.Nat. Genet. 2016; 48: 1443-1448Crossref PubMed Scopus (711) Google Scholar on chromosome X) with Minimac3.29Das S. Forer L. Schönherr S. Sidore C. Locke A.E. Kwong A. Vrieze S.I. Chew E.Y. Levy S. McGue M. et al.Next-generation genotype imputation service and methods.Nat. Genet. 2016; 48: 1284-1287Crossref PubMed Scopus (1485) Google Scholar Excluding variants with low imputation quality (R2 < 0.3) resulted in dense mapping at over 39 million quality-imputed genetic markers. Phenotypes derived from 8,940 ICD-9 billing codes were classified into 1,815 PheWAS disease states of shared disease etiology, of which 1,448 had at least 20 cases. Standard code translations were used for converting the taxonomy of diagnostic ICD-9 codes into PheWAS code groups (PheWAS code translation table v.1.230Carroll R.J. Bastarache L. Denny J.C. R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment.Bioinformatics. 2014; 30: 2375-2376Crossref PubMed Scopus (165) Google Scholar). Cases were derived from EHRs of individuals with at least two encounters with an ICD-9 billing code. This is a typical example of many recent large-scale PheWASs. To compare our proposed fastSPA-2 with Score and the current gold-standard Firth’s test in analyzing such PheWAS data, we performed genome-wide association analyses for four selected traits—skin cancer (PheWAS code: 172), type 2 diabetes (PheWAS code: 250.2; MIM: 125853), primary hypercoagulable state (PheWAS code: 286.81; MIM: 188055), and cystic fibrosis (PheWAS code: 499; MIM: 219700)—in 18,267 unrelated individuals of European ancestry while adjusting for age, sex, and four principal components. Genotyped samples with any missing covariate information were excluded from the analysis. Given that imputati" @default.
- W2952535009 created "2019-06-27" @default.
- W2952535009 creator A5049355350 @default.
- W2952535009 creator A5054956552 @default.
- W2952535009 creator A5062245263 @default.
- W2952535009 creator A5078559681 @default.
- W2952535009 date "2017-07-01" @default.
- W2952535009 modified "2023-10-01" @default.
- W2952535009 title "A Fast and Accurate Algorithm to Test for Binary Phenotypes and Its Application to PheWAS" @default.
- W2952535009 cites W1750145230 @default.
- W2952535009 cites W1965958910 @default.
- W2952535009 cites W1968624767 @default.
- W2952535009 cites W1978116054 @default.
- W2952535009 cites W1979421770 @default.
- W2952535009 cites W1996299724 @default.
- W2952535009 cites W2011693952 @default.
- W2952535009 cites W2012486622 @default.
- W2952535009 cites W2016104730 @default.
- W2952535009 cites W2020658897 @default.
- W2952535009 cites W2029119894 @default.
- W2952535009 cites W2036145773 @default.
- W2952535009 cites W2049361783 @default.
- W2952535009 cites W2049914201 @default.
- W2952535009 cites W2060427373 @default.
- W2952535009 cites W2063655230 @default.
- W2952535009 cites W2078078890 @default.
- W2952535009 cites W2078129874 @default.
- W2952535009 cites W2078797426 @default.
- W2952535009 cites W2081596026 @default.
- W2952535009 cites W2087036932 @default.
- W2952535009 cites W2095970568 @default.
- W2952535009 cites W2102215872 @default.
- W2952535009 cites W2103347314 @default.
- W2952535009 cites W2110549414 @default.
- W2952535009 cites W2113105800 @default.
- W2952535009 cites W2116868464 @default.
- W2952535009 cites W2125391335 @default.
- W2952535009 cites W2131602910 @default.
- W2952535009 cites W2133517490 @default.
- W2952535009 cites W2176146180 @default.
- W2952535009 cites W2510973425 @default.
- W2952535009 cites W2511515754 @default.
- W2952535009 cites W2529241974 @default.
- W2952535009 cites W278413204 @default.
- W2952535009 cites W4231616774 @default.
- W2952535009 doi "https://doi.org/10.1016/j.ajhg.2017.05.014" @default.
- W2952535009 hasPubMedCentralId "https://www.ncbi.nlm.nih.gov/pmc/articles/5501775" @default.
- W2952535009 hasPubMedId "https://pubmed.ncbi.nlm.nih.gov/28602423" @default.
- W2952535009 hasPublicationYear "2017" @default.
- W2952535009 type Work @default.
- W2952535009 sameAs 2952535009 @default.
- W2952535009 citedByCount "113" @default.
- W2952535009 countsByYear W29525350092017 @default.
- W2952535009 countsByYear W29525350092018 @default.
- W2952535009 countsByYear W29525350092019 @default.
- W2952535009 countsByYear W29525350092020 @default.
- W2952535009 countsByYear W29525350092021 @default.
- W2952535009 countsByYear W29525350092022 @default.
- W2952535009 countsByYear W29525350092023 @default.
- W2952535009 crossrefType "journal-article" @default.
- W2952535009 hasAuthorship W2952535009A5049355350 @default.
- W2952535009 hasAuthorship W2952535009A5054956552 @default.
- W2952535009 hasAuthorship W2952535009A5062245263 @default.
- W2952535009 hasAuthorship W2952535009A5078559681 @default.
- W2952535009 hasBestOaLocation W29525350091 @default.
- W2952535009 hasConcept C104317684 @default.
- W2952535009 hasConcept C11413529 @default.
- W2952535009 hasConcept C127716648 @default.
- W2952535009 hasConcept C151730666 @default.
- W2952535009 hasConcept C2777267654 @default.
- W2952535009 hasConcept C33923547 @default.
- W2952535009 hasConcept C41008148 @default.
- W2952535009 hasConcept C48372109 @default.
- W2952535009 hasConcept C54355233 @default.
- W2952535009 hasConcept C70721500 @default.
- W2952535009 hasConcept C86803240 @default.
- W2952535009 hasConcept C94375191 @default.
- W2952535009 hasConceptScore W2952535009C104317684 @default.
- W2952535009 hasConceptScore W2952535009C11413529 @default.
- W2952535009 hasConceptScore W2952535009C127716648 @default.
- W2952535009 hasConceptScore W2952535009C151730666 @default.
- W2952535009 hasConceptScore W2952535009C2777267654 @default.
- W2952535009 hasConceptScore W2952535009C33923547 @default.
- W2952535009 hasConceptScore W2952535009C41008148 @default.
- W2952535009 hasConceptScore W2952535009C48372109 @default.
- W2952535009 hasConceptScore W2952535009C54355233 @default.
- W2952535009 hasConceptScore W2952535009C70721500 @default.
- W2952535009 hasConceptScore W2952535009C86803240 @default.
- W2952535009 hasConceptScore W2952535009C94375191 @default.
- W2952535009 hasIssue "1" @default.
- W2952535009 hasLocation W29525350091 @default.
- W2952535009 hasLocation W29525350092 @default.
- W2952535009 hasLocation W29525350093 @default.
- W2952535009 hasLocation W29525350094 @default.
- W2952535009 hasLocation W29525350095 @default.
- W2952535009 hasOpenAccess W2952535009 @default.
- W2952535009 hasPrimaryLocation W29525350091 @default.
- W2952535009 hasRelatedWork W2110367415 @default.