差异表达基因分析：差异倍数(fold change), 差异的显著性(P

2023-08-20 11:11| 来源: 网络整理| 查看: 265

Differential gene expression analysis：差异表达基因分析

Differentially expressed gene (DEG)：差异表达基因

差异表达分析是目前比较常用的识别疾病相关miRNA以及基因的方法，目前也有很多差异表达分析的方法，但比较简单也比较常用的是Fold change方法。

它的优点是计算简单直观，缺点是没有考虑到差异表达的统计显著性；通常以2倍差异为阈值，判断基因是否差异表达。Fold change的计算公式如下：

即用疾病样本的表达均值除以正常样本的表达均值。

差异表达分析的目的：识别两个条件下表达差异显著的基因，即一个基因在两个条件中的表达水平，在排除各种偏差后，其差异具有统计学意义。我们利用一种比较常见的T检验（T-test）方法来寻找差异表达的miRNA。T检验的主要原理为：对每一个miRNA计算一个T统计量来衡量疾病与正常情况下miRNA表达的差异，然后根据t分布计算显著性p值来衡量这种差异的显著性，T统计量计算公式如下：

差异倍数(fold change)

fold change翻译过来就是倍数变化，假设A基因表达值为1，B表达值为3，那么B的表达就是A的3倍。一般我们都用count、TPM或FPKM来衡量基因表达水平，所以基因表达值肯定是非负数，那么fold change的取值就是(0, +∞).

为什么我们经常看到差异基因里负数代表下调、正数代表上调？因为我们用了log2 fold change。

当expr(A) < expr(B)时，B对A的fold change就大于1，log2 fold change就大于0（见下图），B相对A就是上调；

当expr(A) > expr(B)时，B对A的fold change就小于1，log2 fold change就小于0。

通常为了防止取log2时产生NA，我们会给表达值加1（或者一个极小的数），也就是log2(B+1) - log2(A+1). 【需要一点对数函数的基础知识】

为什么不直接用表达之差，差值接有正负啊？

假设A表达为1，B表达为8，C表达为64；直接用差值，B相对A就上调了7，C就相对B上调了56；用log2 fold change，B相对A就上调了3，C相对B也只上调了3.

通过测序观察我们发现，不同基因在细胞里的表达差异非常巨大，所以直接用差显然不合适，用log2 fold change更能表示相对的变化趋势。

虽然大家都在用log2 fold change，但显然也是有缺点的：

一、到底是5到10的变化大，还是100到120的变化大？

二、5到10可能是由于技术误差导致的。所以当基因总的表达值很低时，log2 fold change的可信度就低了，尤其是在接近0的时候。

A disadvantage and serious risk of using fold change in this setting is that it is biased[7] and may misclassify differentially expressed genes with large differences (B − A) but small ratios (B/A), leading to poor identification of changes at high expression levels. Furthermore, when the denominator is close to zero, the ratio is not stable, and the fold change value can be disproportionately affected by measurement noise.

差异的显著性(P-value)

这就是统计学的范畴了，显著性就是根据假设检验算出来的。

假设检验首先必须要有假设，我们假设A和B的表达没有差异（H0，零假设），然后基于此假设，通过t test（以RT-PCR为例）算出我们观测到的A和B出现的概率，就得到了P-value，如果P-value 1.0) and the top N genes by L2FC for each cluster were retained. Genes with L2FC < 0 or adjusted p-value >= 0.10 were grayed out. The number of top genes shown per cluster, N, is set to limit the number of table entries shown to 10000; N=10000/K^2 where K is the number of clusters. N can range from 1 to 50. For the full table, please refer to the "differential_expression.csv" files produced by the pipeline.

不同单细胞DEG鉴定工具的比较

Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data

For data with a high level of multimodality, methods that consider the behavior of each individual gene, such as DESeq2, EMDomics, Monocle2, DEsingle, and SigEMD, show better TPRs. 这些工具敏感性高，就是说不会漏掉很多真的DEG，但是会包含很多假的DEG。

If the level of multimodality is low, however, SCDE, MAST, and edgeR can provide higher precision. 这些工具精准性很高，意味着得到的DEG里假的很少，所以会漏掉很多真的DEG，不会引入假的DEG。

time-course DEG analysis

Comparative analysis of differential gene expression tools for RNA sequencing time course data

参考：

Question: How to calculate "fold changes" in gene expression?

Exact Negative Binomial Test with edgeR

Differential gene expression analysis

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章