Skip to main content
v2026.1714 entries · CC-BY 4.0
CASRAI

Editorial · CASRAI · Research data infrastructure

Amino Acids: Notation, Protein Data and How Sequences Are Recorded

Amino acids are the chemical building blocks of proteins. This guide explains the 20 standard amino acids, their one- and three-letter notation, and how protein sequence and structure data are recorded and shared through UniProt and the Protein Data Bank.

ByCASRAI Editorial Board
Published 18 Jun 2026· 4 minute read

Amino acids are small organic molecules that join together in chains to build proteins, and the 20 standard amino acids form the common alphabet used to write and share protein sequence data. Each amino acid carries a standard one-letter and three-letter abbreviation, giving researchers an unambiguous notation for recording sequences in databases, publications and data-exchange formats.

From a data-infrastructure perspective, amino acids matter less as chemistry and more as a controlled vocabulary: a fixed set of symbols that lets sequence information move reliably between laboratories, repositories and software tools without loss of meaning.

The 20 standard amino acids and their notation

Proteins are built from 20 standard amino acids, each of which has a residue name, a three-letter code and a single-letter code. The single-letter codes are the backbone of compact sequence notation, allowing a protein of several hundred residues to be written as one continuous string of letters.

Amino acid Three-letter One-letter
Alanine Ala A
Glycine Gly G
Leucine Leu L
Serine Ser S
Tryptophan Trp W
Tyrosine Tyr Y

The full set of 20 covers residues such as alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine and valine. Standardised codes mean that a sequence recorded in one system is read identically in another, which is the foundation of interoperable protein data. Consistent notation of this kind is exactly the type of controlled term documented in the CASRAI dictionary.

From sequence to structure: how protein data is recorded

Protein data exists at two complementary levels. Sequence data describes the linear order of amino-acid residues, while structure data describes the three-dimensional arrangement of atoms once the chain has folded. Both layers need stable identifiers and agreed formats so that records remain findable and reusable over time.

Sequence records are commonly written in FASTA format, a plain-text convention in which a header line carries an identifier and the following lines hold the one-letter residue string. Structure records use formats associated with atomic coordinates, capturing the position of each atom rather than only the residue order.

UniProt and the Protein Data Bank

Two long-standing resources anchor protein data sharing. UniProt is a comprehensive, curated repository of protein sequence and functional information, assigning persistent accession identifiers to protein entries. The Protein Data Bank (PDB) is the established archive for experimentally determined three-dimensional structures of proteins and other biological macromolecules.

Resource Primary content Identifier role
UniProt Protein sequences and functional annotation Stable accession per protein entry
Protein Data Bank (PDB) 3D structural coordinates Stable entry identifier per structure

Both resources illustrate good practice for research data infrastructure: persistent identifiers, structured metadata and open access to underlying records. Linking a sequence accession to a structure entry creates a navigable web of evidence, much as persistent identifiers connect outputs across the wider scholarly record described in our overview of persistent identifiers in 2026.

Why standard notation supports reproducibility

Because the amino-acid alphabet is fixed and the abbreviations are standardised, protein data aligns naturally with the FAIR principles — findable, accessible, interoperable and reusable. A sequence written in standard one-letter notation can be searched, aligned and compared across repositories without manual reconciliation, and a structure deposited with rich metadata can be revisited by independent researchers. This connects protein data to the broader agenda covered in our reproducibility news, and to the related question of how genomic data is shared responsibly in genomic data-sharing standards explained. For practical guidance on citing and describing such resources, see our guidance for authors.

Frequently asked questions

How many standard amino acids are there?

There are 20 standard amino acids that serve as the common building blocks of proteins. Each has an agreed three-letter and one-letter abbreviation, forming a fixed alphabet for recording and sharing sequence data.

What is the difference between one-letter and three-letter amino-acid codes?

Three-letter codes such as Ala or Gly are readable abbreviations often used in text and structural records, while one-letter codes such as A or G create compact sequence strings ideal for databases and alignment software. Both refer to the same residues.

What do UniProt and the PDB store?

UniProt stores curated protein sequence and functional information with stable accession identifiers, while the Protein Data Bank stores experimentally determined three-dimensional structures with their own persistent entry identifiers. Together they cover the sequence and structure layers of protein data.

How do amino-acid standards support FAIR data?

A fixed notation and well-described repository records make protein data findable, accessible, interoperable and reusable. Standard codes remove ambiguity, so sequences and structures can be exchanged and compared across systems without loss of meaning.

Referenced across the research world

University of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logoUniversity of Cambridge logoColumbia University logoUniversity of Edinburgh logoHarvard University logoUniversity of Oxford logoPrinceton University logoStanford School of Medicine logoUniversity College London logoORCID logoCrossref logo
  • University of Cambridge logo
  • Columbia University logo
  • University of Edinburgh logo
  • Harvard University logo
  • University of Oxford logo
  • Princeton University logo
  • Stanford School of Medicine logo
  • University College London logo
  • ORCID logo
  • Crossref logo

View CASRAI adoption →