删除重复的fasta序列(bash的biopython方法)。 您所在的位置:网站首页 linux去重复 删除重复的fasta序列(bash的biopython方法)。

删除重复的fasta序列(bash的biopython方法)。

#删除重复的fasta序列(bash的biopython方法)。| 来源: 网络整理| 查看: 265

百度翻译此文   有道翻译此文 问题描述

I have a fasta file such as:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence2 [virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence3 MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence5 hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence6 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence7 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

And in this file I would like to remove duplicated sequence and get:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence6 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

Here as you can see the content after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater which one I keep.

So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6

Does someone have an idea using biopython or a bash method?

推荐答案

You could use this awk one-liner:

$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file

Output:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence6 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:

$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file

Or in pretty-print:

$ awk ' BEGIN { FS="\n" RS="" } { if(!($2,$3) in seen||$1~/^[^ ]+_CP /) seen[$2,$3]=$0 } END { for(i in seen) print (++j>1?ORS:"") seen[i] }' file

The output order is awk default ie. appears random.

Update If @kvantour's BOTH comments are valid in this case, use this awk:

$ awk ' BEGIN { FS="\n" RS="" } { for(i=2;i1?ORS:"") seen[i] }' file

Output now:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK 其他推荐答案

In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:

The following awk will do exactly that:

$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"} {seq="";for(i=2;i


【本文地址】

公司简介

联系我们

今日新闻

    推荐新闻

    专题文章
      CopyRight 2018-2019 实验室设备网 版权所有