|
|
|
|
|
|
|
FASTA is the most widely used sequence format. It has a very simple structure of a one line header followed by lines of sequence data:
- The header starts with a " >" symbol.
- The first word on this line is the name of the sequence, the rest of the line is the description of the sequence.
- The remaining lines contain the sequence itself in IUB/IUPAC single-letter codes.
- Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.
- Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs.
The description line often depends on the database you downloaded the sequence from:
- Genbank: gi|ginumber|gb|accession|locus
- SwissProt: sp|accession|entry name
- PIR: pir||entr
|
>gi|726297|gb|AAA64213.1| obesity protein
MCWRPLCRFLWLWSYLSYVQAVPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDFIPGLHPIL
SLSKMDQTLAVYQQVLTSLPSQNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLY
STEVVALSRLQGSLQDILQQLDVSPEC |
|
|