SAIGE | Notion

계획

Part 1 (SAIGE): Single variant test
Part 2 (SKAT and SKAT-O): Region-based test
Part 3 (SAIGE-GENE and SAIGE-GENE+): Expansion of SKAT/SKAT-O for large-scale data

SAIGE의 프레임워크 이해
기존에 존재하던 GWAS 방법과 SAIGE의 차이점 이해
제목의 의미 이해: (1) Efficiently controlling for (2) case-control imbalance and (3) sample relatedness in (4) large-scale genetic association studies
- 이를 달성하기 위해서 사용된 여러 테크닉에 대한 이해
실제 분석에 사용되는 파일들과 파라미터를 상황에 따라 세팅하기

Simple linear or logistic regression: sample relatedness를 감안하지 못하므로 related individual을 제외하고 남은 사람들로 분석을 수행해야 하므로 sample size에 손실이 발생함 → Test의 power가 낮아짐
Linear mixed model (ex. BOLT-LMM): sample relatedness를 보정하기 위한 방법으로 mixed-effect model을 사용하였음 → Binary phenotype을 분석하기 위해 나온 모형이 아니어서, binary phenotype을 분석할 경우 (특히, case-control imbalance가 심한 경우) type 1 error inflation이 발생함
Generalized mixed model (ex. GMMAT): 위의 Linear mixed model에서 binary phenotype을 분석하기 위해 GMM을 사용하였음. 하지만, 이 역시 case-control imbalance가 심한 상황에서는 type 1 error inflation이 발생하는데, 그 이유는 case-control imbalance가 심한 경우 test statistic의 분포가 asymtotically normal을 따르지 않기 때문임. 또한, GMMAT 같은 방법은 $O(MN^2)$의 time complexity를 가지는데, 대규모 바이오뱅크 데이터를 분석하기 위해 충분히 빠르지 않음

$$ \text{logit}(\mu_i)= X_i {\alpha} + G_i \beta + b_i $$

$\mu_i$: probability for $i^{th}$ individual being a case given the covariates, genotype, and random effect ($b_i$)
${X}_i$: $p \times 1$ vector of covariates of $i^{th}$ individual
$\alpha$: $p \times 1$ vector of coefficients of covariates
$G_i$ : genotype of test marker (variant / SNP) of $i^{th}$ individual
$\beta$ : a coefficient of genetic effect