Sequence Analysis with Distributed Resources - Data Formats

		The FASTA Data Format
		FASTA is the most widely used sequence format. It has a very simple structure of a one line header followed by lines of sequence data: The header starts with a " >" symbol. The first word on this line is the name of the sequence, the rest of the line is the description of the sequence. The remaining lines contain the sequence itself in IUB/IUPAC single-letter codes. Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence. Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs. The description line often depends on the database you downloaded the sequence from: Genbank: gi\|ginumber\|gb\|accession\|locus SwissProt: sp\|accession\|entry name PIR: pir\|\|entr
This is an example of a FASTA formatted file downloaded from GenBank >gi\|726297\|gb\|AAA64213.1\| obesity protein MCWRPLCRFLWLWSYLSYVQAVPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDFIPGLHPIL SLSKMDQTLAVYQQVLTSLPSQNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLY STEVVALSRLQGSLQDILQQLDVSPEC