about sitemap home home
Databases Data Formats Database Search Genome Browser RNA Secondary Structure Alignments Primer Design WebServices
FASTA Genbank EMBL XML
Exercise FASTA
Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ
   
The FASTA Data Format
exercise.png 57x15  
FASTA is the most widely used sequence format. It has a very simple structure of a one line header followed by lines of sequence data:
  • The header starts with a " >" symbol.
  • The first word on this line is the name of the sequence, the rest of the line is the description of the sequence.
  • The remaining lines contain the sequence itself in IUB/IUPAC single-letter codes.
  • Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.
  • Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs.

The description line often depends on the database you downloaded the sequence from:

  • Genbank: gi|ginumber|gb|accession|locus
  • SwissProt: sp|accession|entry name
  • PIR: pir||entr
This is an example of a FASTA formatted file downloaded from GenBank
>gi|726297|gb|AAA64213.1| obesity protein 
            MCWRPLCRFLWLWSYLSYVQAVPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDFIPGLHPIL 
            SLSKMDQTLAVYQQVLTSLPSQNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLY 
            STEVVALSRLQGSLQDILQQLDVSPEC