Wednesday, 7 August 2013

Extracting sequences from a fasta file by ID number in header

Extracting sequences from a fasta file by ID number in header

I have a fasta file with multiple sequences with headers that look like this:
>1016BSA34080.1
MTHSVRIITVTVNFLQHRFFIDYMSEIGLLDGEIEQMVSALQEQVHIVARARTLPEMKNLERDTHVIVKT
LKKQLTAFHSEVKKIADSTQRSRYEGKHQTYEAKVKDLEKELRTQIDPPPKSVSEKHMEDLMGEGGPDGS
GFKTTDQVLRAGIRIQNDA
>1038BSA81955.1
MQQQQARRRMEEPTAAAATASSTTSFAAQPLLSRSVAPQAASSPQASARLAESAGFRSAAVFGSAQAAVG
GRGRGGFGAPPGRGGFGAPPAAGFGAAPAFGAPPTLQAFSAAPAPGGFGAPPAPQGFGAPRAAGFGAPPA
PQAFSAVAPASSTAIPLDVTTYLGDTFGSAPTRGPP
The 4 digit number at the start of the header is a unique ID for the
sequence.
Could you help me write a python script to extract sequences by the 4
digit ID (in a text file with one ID per line)?
I tried modifying this script (I found on this website: Extract sequences
from a FASTA file based on entries in a separate file) to suit my purpose
(in vain):
f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')
AI_DICT = {}
for line in f2:
AI_DICT[line[:-1]] = 1
skip = 0
for line in f1:
if line[0] == '>':
_splitline = line.split('|')
accessorIDWithArrow = _splitline[0]
accessorID = accessorIDWithArrow[1:-1]
# print accessorID
if accessorID in AI_DICT:
f3.write(line)
skip = 0
else:
skip = 1
else:
if not skip:
f3.write(line)
f1.close()
f2.close()
f3.close()
I'm new to Python, any help will be greatly appreciated! Thanks -Divya

No comments:

Post a Comment