基因家族分析（4）：基因家族蛋白质模体鉴定与可视化

2023-03-21 12:39| 来源: 网络整理| 查看: 265

本文主要工作

使用meme鉴定了SBT家族的蛋白质模体组成对meme鉴定结果进行处理并用ggplot2进行可视化 4.蛋白质与基因结构可视化分析 4.1 蛋白质模体预测

Protein Motif这个概念比较混乱，需要在这里特别说明。在生物化学中，一个比较清晰的英文定义是这样给出的：

”Protein motifs are small regions of protein three-dimensional structure or amino acid sequence shared among different proteins. They are recognizable regions of protein structure that may (or may not) be defined by a unique chemical or biological function.” (https://bio.libretexts.org/Bookshelves/Cell_and_Molecular_Biology/Book%3A_Basic_Cell_and_Molecular_Biology_(Bergtrom)/03%3A_Details_of_Protein_Structure/3.06%3A_Protein_Domains_Motifs_and_Folds_in_Protein_Structure))

翻译成中文即为 “蛋白质模体是不同蛋白质中共有的、小区域的蛋白质三维结构或是氨基酸序列。它们是蛋白质结构中可被识别的区域，可能(或可能不)具备生物学功能。”因此，我们可以大致理解为结构特别、可能具备一定功能的小的氨基酸序列，这个适用范围是所有的蛋白质的。而在基因家族分析中，它所适应的范围是我们所鉴定的基因家族内部的所有蛋白质，需要得到的模体短而保守，并可能具备一定的功能，从而借此呈现不同蛋白质间的进化关系。

在明细概念后，我们就可以进行分析了。这里使用meme这款软件在使用conda安装后进行本地化分析，当然它也有网页版的，但是灵活度上讲我个人还是推荐本地分析。在这里的脚本中，我们使用之前鉴定得到的SBT家族蛋白质序列。但是同样为了美观，我们要把蛋白质序列中的序列号“.1”删去。随后使用meme鉴定motif。输出的结果存放地址可以通过参数指定。此外，我在这里还通过seqtk comp生成了一个统计各个蛋白质长度的文件，这是为之后我们可视化蛋白质模体分布服务的。

在meme输出的文件夹中存放了不同类型的文件。其中以.png结尾的即为各个模体的seqlogo图。模体的网页版报告可以通过.html结尾文件查看。而我们需要的信息存放在以.txtx结尾的文件中。由于该文件十分复杂，而我们想要提取想要的信息，需要依靠特殊的脚本解决，这里提供一个思路（python 处理）：

#!/usr/bin/env python import re import argparse parser = argparse.ArgumentParser(description='meme file stat') parser.add_argument('-i', '--input_file', type=str, required=True, metavar='', help='Input txt file the meme outputs, i.e., meme.txt') parser.add_argument('-o', '--output_file', type=str, required=True, metavar='', help='output stat file') args = parser.parse_args() content = [] motif_order = 1 pattern1 = "Motif ([A-Z]+) MEME-[0-9]+ sites sorted by position p-value" pattern2 = "Aco[0-9]+\s+[0-9]+\s+[0-9]+\.[0-9]+e-[0-9]+\s+[A-Z]+\s+[A-Z]+\s+[A-Z]+" Motif_info_dict = {} Motif_sample_dict = {} seq_motif_info_all = [] with open(args.input_file) as f: while True: content_list = f.readline() if not content_list: break else: content_list = content_list.replace("\t", "").replace("\n", "") if re.search("\+s", content_list): continue if re.search(pattern1, content_list): match = re.findall(pattern1, content_list) Motif_name = "Motif" + str(motif_order) motif_order = int(motif_order) + 1 Motif_sample_dict["seq"] = match[0] Motif_sample_dict["length"] = len(match[0]) Motif_info_dict[Motif_name] = Motif_sample_dict if re.search(pattern2, content_list): protein_motif_info_list = re.split("\s+", content_list) seq_name = protein_motif_info_list[0] seq_start = int(protein_motif_info_list[1]) seq_end = int(Motif_sample_dict["length"]) + int(seq_start) - 1 seq_motif = Motif_name seq_motif_seq = str(protein_motif_info_list[4]) seq_motif_info = [seq_name, seq_motif, seq_start, seq_end, seq_motif_seq] seq_motif_info_all.append(seq_motif_info) for i in seq_motif_info_all: output = i[0] + "\t" + i[1] + "\t" + str(i[2]) + "\t" + str(i[3]) + "\t" + i[4] + "\n" with open(args.output_file, "a") as f: f.write(output)

得到处理好的文件后，为了呈现文章中的效果，我们可以用ggplot2进行可视化，结合我之前的文件在这里提供一个思路：

library(data.table) library(tidyverse) # 载入文本 pep_meme % group_by(V1) pep_length

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章