比较基因组学分析2:基因家族收缩与扩张分析 您所在的位置:网站首页 基因富集分析原理是什么 比较基因组学分析2:基因家族收缩与扩张分析


2023-09-26 05:23| 来源: 网络整理| 查看: 265


1:单拷贝基因构建物种树以及计算分化时间 2:基因家族收缩与扩张分析 3:特异节点富集分析



Sun et al 2022 WGDI 本篇推文主要讲基因家族的收缩与扩张分析,使用的软件是cafe5,2020年发表,相较于cafe4来讲操作更加方便并且新增了模型(Gamma)

1. 安装 git clone https://github.com/hahnlab/CAFE5.git cd CAFE5 ./configure make


2. CAFE5使用


2.1 主要参数 --fixed_alpha, -a Alpha value of the discrete gamma distribution to use in category calculations. If not specified, the alpha parameter will be estimated by maximum likelihood. --lambda_per_family, -b Estimate lambda by family (for testing purposes only). --cores, -c Number of processing cores to use, requires an integer argument. Default=All available cores. --error_model, -e Run with no file name to estimate the global error model file. This file can be provided in subsequent runs by providing the path to the Error model file with no spaces (e.g. -eBase_error_model.txt). --Expansion, -E Expansion parameter for Nelder-Mead optimizer, Default=2. --rootdist, -f Path to root distribution file for simulating datasets. --help, -h Help menu with a list of all commands. --infile, -i Path to tab delimited gene families file to be analyzed - Required for estimation. --Iterations, -I Maximum number of iterations that will be performed in lambda search. Default=300 (increase this number if likelihood is still improving when limit is hit). --n_gamma_cats, -k Number of gamma categories to use. If specified, the Gamma model will be used to run calculations; otherwise the Base model will be used. --fixed_lambda, -l Value (between 0 and 1) for a single user provided lambda value, otherwise lambda is estimated. --log_config, -L Turn on logging, provide name of the configuration file for logging (see example log.config file). --fixed_multiple_lambdas, -m Multiple lambda values, comma separated, must be used in conjunction with lambda tree (-y). --output_prefix, -o Output directory - Name of directory automatically created for output. Default=results. --poisson, -p Use a Poisson distribution for the root frequency distribution. If no -p flag is given, a uniform distribution will be used. A value can be specified (-p10, or --poisson=10); otherwise the distribution will be estimated from the gene families. --pvalue, -P P-value to use for determining significance of family size change, Default=0.05. --chisquare_compare, -r Chi square compare (not tested). --Reflection, -R Reflection parameter for Nelder-Mead optimizer, Default=1. --simulate, -s Simulate families. Either provide an argument of the number of families to simulate (-s100, or --simulate=100) or provide a rootdist file giving a set of root family sizes to match. Without such a file, the families will be generated with root sizes selected randomly between 0 and 100. --tree, -t Path to file containing newick formatted tree - Required for estimation. --lambda_tree, -y Path to lambda tree, for use with multiple lambdas. --zero_root, -z Include gene families that don't exist at the root, not recommended.

其实主要用的就是-i -p -k -y -t这些参数

2.2 输入文件准备 2.2.1. Genefamilies_Count.tsv

制表符分隔的基因家族计数文件,通常用OrthoMCL, OrthoFinder等软件获取计数信息。 示例格式

Desc Family ID human chimp orang baboon gibbon macaque marmoset rat mouse cat horse cow ATPase ORTHOMCL1 52 55 54 57 54 56 56 53 52 57 55 54 (null) ORTHOMCL2 76 51 41 39 45 36 37 67 79 37 41 49 HMG box ORTHOMCL3 50 49 48 48 46 49 48 55 52 51 47 55 (null) ORTHOMCL4 43 43 47 53 44 47 46 59 58 51 50 55 Dynamin ORTHOMCL5 43 40 43 44 31 46 33 79 70 43 49 50 ...... .... .. DnaJ ORTHOMCL10016 45 46 50 46 46 47 46 48 49 45 44 48


cp Results_May02/Orthogroups/Orthogroups.GeneCount.tsv CAFE/ awk 'OFS="\t" {$NF=""; print}' Orthogroups.GeneCount.tsv > tmp && awk '{print "(null)""\t"$0}' tmp > cafe.input.tsv && sed -i '1s/(null)/Desc/g' cafe.input.tsv && rm tmp


Desc Orthogroup Aof.pro Ath.pro Atr.pro Cba.pro Cri.pro Csa.pro Csu.pro Kle.pro Mpo.pro Nco.pro Osa.pro Ppa.pro Smo.pro Tpl.pro Vca.pro Vvi.pro Zma.pro (null) OG0000000 145 112 95 5 372 129 3 1 2 217 126 16 206 419 4 177 117 (null) OG0000001 9 4 3 1691 9 96 2 56 2 4 21 0 2 5 3 2 0 (null) OG0000002 32 117 62 1 92 117 2 0 20 81 119 77 40 193 5 107 161 (null) OG0000003 37 104 54 3 89 76 4 5 10 74 144 22 47 134 8 79 154 (null) OG0000004 73 104 51 4 40 80 2 10 12 76 87 33 22 136 5 97 135 (null) OG0000005 28 46 36 11 37 47 0 3 50 81 81 32 48 120 0 54 73 (null) OG0000006 41 43 74 6 38 57 0 4 25 57 52 19 33 155 0 87 40 (null) OG0000007 58 52 60 0 18 42 0 0 12 50 56 17 57 99 1 82 52 (null) OG0000008 38 57 26 7 52 47 4 6 19 40 59 43 20 29 1 41 80 (null) OG0000009 46 57 26 1 25 46 1 2 11 52 65 29 13 50 1 48 87


python ~/soft/CAFE5/tutorial/clade_and_size_filter.py -i cafe.input.tsv -o gene_family_filter.txt -s


awk 'NR==1 || $3






      CopyRight 2018-2019 实验室设备网 版权所有