删除重复的fasta序列（bash的biopython方法）。

#删除重复的fasta序列（bash的biopython方法）。| 来源: 网络整理| 查看: 265

百度翻译此文有道翻译此文问题描述

I have a fasta file such as:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence2 [virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence3 MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence5 hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence6 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence7 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

And in this file I would like to remove duplicated sequence and get:

>sequence1_CP [seq virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNL DITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE >sequence4_CP hypothetical protein [another virus] MLRHSCVMPQQKLKKRFFFLRRLRKILRYFFTCNFLNLFFINREYNIENITLSYLKKERIPVWKTSDMSN IVRKWWMFHRKTQLEDNIEIKKDIQLYHFFYNGLFIKTNYPYVYHIDKKKKYDFNDMKVIYLPAIHMHSK >sequence6 |hypothetical protein[virus] MQCKSGTNNVFTAIKYTTNNNIIYKSENNDNIIFTKNIFNVVTTKDAFIFSKNRGIMNLD ITKKFDYHEHRPKLCVFKIINTQYVNSPEKMIDAWPTMDIVALITE

Here as you can see the content after the > name for sequence1_CP, sequence2 and sequence3 is the same, then I want only to keep on of the 3. But if one of the 3 sequences have a _CP in its name, then I want to keep this one especially. If there is none _CP in any of them it does not mater which one I keep.

So for the first duplicates between Sequence1_CP, Sequence2 and Sequence3 I keep sequence1_CP For the second duplicates between sequence4_CP and sequence5 I keep sequence4_CP And for the third duplicates between sequence6 and sequence7 I keep the first one sequence6

Does someone have an idea using biopython or a bash method?

推荐答案

You could use this awk one-liner:

$ awk 'BEGIN{FS="\n";RS=""}{if(!seen[$2,$3]++)print}' file

Output:

Above relies on observation that the sequences are in order where the _CPs come before others like in the sample. If this is not in fact the case, use the following. It stores the first instance of each sequence which is overwritten if a _CP sequence is found:

$ awk 'BEGIN{FS="\n";RS=""}{if(!($2,$3) in seen||$1~/^[^ ]+_CP /)seen[$2,$3]=$0}END{for(i in seen)print (++j>1?ORS:"") seen[i]}' file

Or in pretty-print:

$ awk ' BEGIN { FS="\n" RS="" } { if(!($2,$3) in seen||$1~/^[^ ]+_CP /) seen[$2,$3]=$0 } END { for(i in seen) print (++j>1?ORS:"") seen[i] }' file

The output order is awk default ie. appears random.

Update If @kvantour's BOTH comments are valid in this case, use this awk:

$ awk ' BEGIN { FS="\n" RS="" } { for(i=2;i1?ORS:"") seen[i] }' file

Output now:

In a fasta file, identical sequences are not necessarily split at the same position. So it is paramount to merge the sequences before comparing. Furthermore, sequences can have upper case or lower case, but are in the end case insensitive:

The following awk will do exactly that:

$ awk 'BEGIN{RS="";ORS="\n\n"; FS="\n"} {seq="";for(i=2;i

【本文地址】

公司简介

联系我们