GWAS

2023-12-21 04:38| 来源: 网络整理| 查看: 265

GWAS入门必看教程：Statistical analysis of genome-wide association (GWAS) data

名词解释和基本问题：

关联分析：就是AS的中文，全称是GWAS。应用基因组中数以百万计的单核苷酸多态；SNP为分子遗传标记，进行全基因组水平上的对照分析或相关性分析，通过比较发现影响复杂性状的基因变异的一种新策略。在全基因组范围内选择遗传变异进行基因分析，比较异常和对照组之间每个遗传变异及其频率的差异，统计分析每个变异与目标性状之间的关联性大小，选出最相关的遗传变异进行验证，并根据验证结果最终确认其与目标性状之间的相关性。

连锁不平衡：LD，P（AB）= P（A）*P（B）。不连锁就独立，如果不存在连锁不平衡——相互独立，随机组合，实际观察到的群体中单倍体基因型 A和B 同时出现的概率。P (AB) = D + P (A) * P (B) 。D是表示两位点间LD程度值。

曼哈顿图：在生物和统计学上，做频率统计、突变分布、GWAS关联分析的时候，我们经常会看到一些非常漂亮的manhattan plot，能够对候选位点的分布和数值一目了然。位点坐标和pvalue。map文件至少包含三列——染色体号，SNP名字，SNP物理位置。assoc文件包含SNP名字和pvalue。haploview即可画出。

SNP的本质属性是什么？广义上讲是变异：most common type of genetic variation，平级的还有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狭义上讲是标记：biological markers，因为SNP是单碱基的，所以SNP又是一个位点，标记了染色体上的一个位置。大部分人的基因组，99%都是一模一样的，还有些SNP的位点，就是一些可变的位点，在人群中有差异。这些差异/标记可以用于疾病的分析，根据统计学原理，找出与疾病最相关的位点，从而确定某个疾病的risk allele。

SNP array是如何工作的？SNP array测得不是单个碱基，而是allele。所以GWAS的结果是三种：(1 - AA; 2 - AB; 3 - BB)，也可能是0、1、2.

linkage disequilibrium (LD)和 pairwise correlation的区别？

如何鉴定Somatic vs Germline Mutations？In multicellular organisms, mutations can be classed as either somatic or germ-line。必须做通常需要trios或healthy tissue的测序才能确定。最显然的是cancer里大部分都是somatic的variations。

SNP、variant和mutation有什么区别？SNP是中性的，mutation显然和疾病相关；其次就是频率，频率很高的是SNP，mutation则很低。variant和variation是同义词，因此和SNP是等价的。

为什么还需要haplotype？HapMap计划的动机是什么？The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.

common variant和rare variant是根据什么来区别的？paper 怎么理解这里的common和rare？variant就是SNP，”常见的变异“，SNP就是位点，一个位点怎么能说常见和不常见呢？这里是有点反直觉的。这里的common说的是minor allele，就是the second most common allele。比如一个SNP：rs78601809，它的位置可知，在不同人群中的allele frequency可知，总体的MAF是0.39 (T)。一个SNP的MAF 0.05) in this population

Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF 0.2 (i.e., second degree relatives) can be detected. 7: Population stratification ‐‐genome Calculates identity by descent (IBD) of all sample pairs. ‐‐cluster ‐‐mds‐plot k Produces a k‐dimensional representation of any substructure in the data, based on IBS.

fine mapping

一个常识就是GWAS是2007年才出现得，所以2017年才出了篇有名的综述ten years of GWAS，fine mapping是GWAS后才出现得。

实验室很早就开始研究fine mapping了：2009 - Fine mapping of the 9q31 Hirschsprung’s disease locus

看一下introduction，什么是fine mapping？

目的很简单：GWAS找到的大多不是causal variants，fine mapping就是就fill这个gap。

GWAS得到大体的SNP后，必须做两方面的深入分析：

第一步就是对SNP给一个概率上的causality，这就是fine-mapping；第二步就是根据功能注释来确定该SNP确实能导致某个基因。

The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation.

基本原理：

Strategies for fine-mapping complex traits -

Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs

Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.

eQTL最大的问题还是celltype的特异性不够，关键还是要celltype的定义足够精准！

现在GWAS已经属于比较古老的技术了，主要是碰到严重的瓶颈了，单纯的snp与表现的关联已经不够，需要具体的生物学解释，这些snp是如何具体导致疾病的发生的。

而且，大多数病找到的都不是个别显著的snp，大多数都找到了很多的snp，而且snp都落在非编码区了，这就导致对这些snp的解读非常的困难。

经典解读看这篇新英格兰杂志上的文章：FTO Obesity Variant Circuitry and Adipocyte Browning in Humans

GWAS的核心结果就两个，曼哈顿图和QQ-plot，看懂就够了。

单纯会跑GWAS pipeline已经没什么价值了，现在重在下游的分析，有几个热点：

Polygenic risk score (PRS) analyses meta-analysis

The International HapMap Project (http://hapmap. ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of com- mon SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www.1000genomes.org/; Altshuler et al., 2012) provided a map of both common and rare SNPs.

common和rare就是根据allele frequency来界定的，但是似乎没有明确界限。

HapMap用的是array，所有测得都是一些人为挑的点，所以就是common snps；而1000 genomes是WGS，所以包含了所有的点，所以有common和rare一起。

GWAS和核心就是LD，目前大部分的GWAS都是测得array，因为便宜。

GWAS会漏掉很多点，所以才会有fine-mapping，根据haplotype来做一些imputation。

Linkage disequilibrium (LD)连锁不平衡：不同基因座位的各等位基因在人群中以一定的频率出现。在某一群体中，不同座位某两个等位基因出现在同一条染色体上的频率高于预期的随机频率的现象。（就是孟德尔的分离不是随机的，在染色体上越靠近的allele越倾向于绑在一起，属于物质性的限制。）

例如两个相邻的基因A B, 他们各自的等位基因为a b. 假设A B相互独立遗传，则后代群体中观察得到的单倍体基因型 AB 中出现的P(AB)的概率为 P(A) * P（B）．实际观察得到群体中单倍体基因型 AB 同时出现的概率为P(AB)。计算这种不平衡的方法为： D = P（AB）－ P(A) * P（B）．

事实上，可以检测遍布基因组中的大量遗传标记位点snp，或者候选基因附近的遗传标记来寻找到因为与致病位点距离足够近而表现出与疾病相关的位点，这就是等位基因关联分析或连锁不平衡定位基因的基本思想。

待看的paper：Strategies for fine-mapping complex traits

assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.

还有一些非常重要的概念：

effect size：效应量

power：功效，power analyses

Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes

在语境里理解：One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes.

PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk

haplotype phasing单倍体分型

Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.

Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.

Lecture 10: Haplotype Phasing - Community Recovery

参考：PLINK | File format reference

vcftools

plink的主要功能：数据处理，质量控制的基本统计，群体分层分析，单位点的基本关联分析，家系数据的传递不平衡检验，多点连锁分析，单倍体关联分析，拷贝数变异分析，Meta分析等等。

首先必须了解plink的三种格式：bed、fam和bim。（注意：这里的bed和我们genome里的区域文件bed完全不同）

plink需要的格式一般可以从vcf文件转化而来 (顺便了解一下ped和map两种格式)：

PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 谱系信息和基因型信息。每一行是一个人。

MAP: Variant information file accompanying a .ped text pedigree + genotype table. 变异信息。每一行是一个变异 | snp。

# PED 1 1 0 0 1 0 G G 2 2 C C 1 2 0 0 1 0 A A 0 0 A C 1 3 1 2 1 2 0 0 1 2 A C 2 1 0 0 1 0 A A 2 2 0 0 2 2 0 0 1 2 A A 2 2 0 0 2 3 1 2 1 2 A A 2 2 A A # MAP 1 snp1 0 1 1 snp2 0 2 1 snp3 0 3 # vcf转ped和map plink --vcf file.vcf --recode --out file # ped和map转bed、bim和fam plink --file test --make-bed --out test

三种格式的官方介绍

bed文件（真实的bed文件是二进制的，比较难读）

bed：Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。所以转换后就是一个matrix，每一行是一个个体，每一列就是一个变异。其中0、1、2分别对应了aa、Aa或aA和AA。不考虑碱基型，因为我们不关注ATGC的变化。

fam：Sample information file accompanying a .bed binary genotype table. 样本信息。每一行就是一个样本。

bim：Extended variant information file accompanying a .bed binary genotype table. 每一行是一个变异，及其注释信息。

rs4970383 rs3748592 rs9442373 rs1571150 rs6687029 2431:NA19916 2 0 0 0 1 2424:NA19835 1 0 1 2 0 2469:NA20282 1 0 1 0 1 2368:NA19703 0 0 0 2 0 2425:NA19901 1 0 1 2 2 OR # xxd -b test.bed 00000000: 01101100 00011011 00000001 11011100 00001111 11100111 l..... 00000006: 00001111 01101011 00000001 .k. First two bytes 01101100 00011011 for PLINK v1.00 BED file Third byte is 00000001 (SNP-major) or 00000000 (individual-major) Genotype data, either in SNP-major or individual-major order New "row" always starts a new byte Each byte encodes up to 4 genotypes 10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively Bits in each byte read in reverse order

fam文件

1 2431 NA19916 0 0 1 2 2424 NA19835 0 0 2 3 2469 NA20282 0 0 2 4 2368 NA19703 0 0 1 5 2425 NA19901 0 0 2 OR 1 1 0 0 1 0 1 2 0 0 1 0 1 3 1 2 1 2 2 1 0 0 1 0 2 2 0 0 1 2 2 3 1 2 1 2

bim文件

1 1 rs4970383 0 828418 A 2 1 rs3748592 0 870101 A 3 1 rs9442373 0 1052501 C 4 1 rs1571150 0 1464167 A 5 1 rs6687029 0 1508931 C OR 1 snp1 0 1 G A 1 snp2 0 2 1 2 1 snp3 0 3 A C

跑跑PLINK工具 plink --bfile --pheno --pheno-name t16 --linear hide-covar --covar --covar-name AGE,SEX,PC1,PC2,PC3,PC4 --ci 0.95 --out --bfile 将snp文件变成二进制格式 --pheno 这里导入我们刚刚处理的性状文件 --pheno-name t16 要处理的性状名字是t16 --linear hide-covar 使用线性模型，hide-covar指的是不要对我没加入的协变量进行分析 --covar --covar-name AGE,SEX,PC1,PC2,PC3,PC4 把我们选取的协变量加入线性回归模型中，我们选的协变量有：AGE，SEX，PC1，PC2，PC3，PC4 --ci 0.95 设置置信区间

SNP过滤问题使用vcftools过滤： 1. MAF

【本文地址】

公司简介

联系我们