Editorial · CASRAI · Research data infrastructure

Amino Acids: Notation, Protein Data and How Sequences Are Recorded

Amino acids are the chemical building blocks of proteins. This guide explains the 20 standard amino acids, their one- and three-letter notation, and how protein sequence and structure data are recorded and shared through UniProt and the Protein Data Bank.

ByCASRAI Editorial Board

Published 18 Jun 2026· 4 minute read

Amino acids are small organic molecules that join together in chains to build proteins, and the 20 standard amino acids form the common alphabet used to write and share protein sequence data. Each amino acid carries a standard one-letter and three-letter abbreviation, giving researchers an unambiguous notation for recording sequences in databases, publications and data-exchange formats.

From a data-infrastructure perspective, amino acids matter less as chemistry and more as a controlled vocabulary: a fixed set of symbols that lets sequence information move reliably between laboratories, repositories and software tools without loss of meaning.

The 20 standard amino acids and their notation

Proteins are built from 20 standard amino acids, each of which has a residue name, a three-letter code and a single-letter code. The single-letter codes are the backbone of compact sequence notation, allowing a protein of several hundred residues to be written as one continuous string of letters.

Amino acid	Three-letter	One-letter
Alanine	Ala	A
Glycine	Gly	G
Leucine	Leu	L
Serine	Ser	S
Tryptophan	Trp	W
Tyrosine	Tyr	Y

The full set of 20 covers residues such as alanine, arginine, asparagine, aspartate, cysteine, glutamate, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine and valine. Standardised codes mean that a sequence recorded in one system is read identically in another, which is the foundation of interoperable protein data. Consistent notation of this kind is exactly the type of controlled term documented in the CASRAI dictionary.

From sequence to structure: how protein data is recorded

Protein data exists at two complementary levels. Sequence data describes the linear order of amino-acid residues, while structure data describes the three-dimensional arrangement of atoms once the chain has folded. Both layers need stable identifiers and agreed formats so that records remain findable and reusable over time.

Sequence records are commonly written in FASTA format, a plain-text convention in which a header line carries an identifier and the following lines hold the one-letter residue string. Structure records use formats associated with atomic coordinates, capturing the position of each atom rather than only the residue order.

UniProt and the Protein Data Bank

Two long-standing resources anchor protein data sharing. UniProt is a comprehensive, curated repository of protein sequence and functional information, assigning persistent accession identifiers to protein entries. The Protein Data Bank (PDB) is the established archive for experimentally determined three-dimensional structures of proteins and other biological macromolecules.

Resource	Primary content	Identifier role
UniProt	Protein sequences and functional annotation	Stable accession per protein entry
Protein Data Bank (PDB)	3D structural coordinates	Stable entry identifier per structure

Both resources illustrate good practice for research data infrastructure: persistent identifiers, structured metadata and open access to underlying records. Linking a sequence accession to a structure entry creates a navigable web of evidence, much as persistent identifiers connect outputs across the wider scholarly record described in our overview of persistent identifiers in 2026.

Why standard notation supports reproducibility

Because the amino-acid alphabet is fixed and the abbreviations are standardised, protein data aligns naturally with the FAIR principles — findable, accessible, interoperable and reusable. A sequence written in standard one-letter notation can be searched, aligned and compared across repositories without manual reconciliation, and a structure deposited with rich metadata can be revisited by independent researchers. This connects protein data to the broader agenda covered in our reproducibility news, and to the related question of how genomic data is shared responsibly in genomic data-sharing standards explained. For practical guidance on citing and describing such resources, see our guidance for authors.

Frequently asked questions

How many standard amino acids are there?

There are 20 standard amino acids that serve as the common building blocks of proteins. Each has an agreed three-letter and one-letter abbreviation, forming a fixed alphabet for recording and sharing sequence data.

What is the difference between one-letter and three-letter amino-acid codes?

Three-letter codes such as Ala or Gly are readable abbreviations often used in text and structural records, while one-letter codes such as A or G create compact sequence strings ideal for databases and alignment software. Both refer to the same residues.

What do UniProt and the PDB store?

UniProt stores curated protein sequence and functional information with stable accession identifiers, while the Protein Data Bank stores experimentally determined three-dimensional structures with their own persistent entry identifiers. Together they cover the sequence and structure layers of protein data.

How do amino-acid standards support FAIR data?

A fixed notation and well-described repository records make protein data findable, accessible, interoperable and reusable. Standard codes remove ambiguity, so sequences and structures can be exchanged and compared across systems without loss of meaning.

Related editorial in this domain

More on Research data infrastructure

21 Jun 2026

Identifiers for Things, Not Just Papers: IGSN and PIDINST

Persistent identifiers are familiar for articles, datasets, and people, but the physical objects of research, the rock cores, water samples, and the instruments that measure them, have long lacked stable references. The IGSN for samples and the PIDINST work for instruments extend persistent identification to the physical world, making physical research objects findable, citable, and connectable to the data they produce.

21 Jun 2026

Anonymising research data: k-anonymity, differential privacy and the re-identification risk

Sharing data about people without exposing the people themselves is one of the hardest problems in research data management. This article distinguishes anonymisation from pseudonymisation, explains the privacy models researchers actually use, k-anonymity, l-diversity and differential privacy, and introduces the practical guidance from the UK Anonymisation Network (UKAN) and the ICO’s anonymisation code. It also confronts the uncomfortable reality that re-identification is often easier than it looks.

20 Jun 2026

Big Data and the Vs of Data Explained for Research

Big data describes datasets so large, fast or varied that traditional tools cannot handle them. This guide explains the defining Vs, from volume and velocity to veracity and value, how distributed processing copes, and what big data means for research and FAIR data.