Introduction

The Protein Data Bank (PDB) is an archive of experimentally determined three-dimensional structures of biological macromolecules that serves a global community of researchers, educators, and students. The data contained in the archive include atomic coordinates, crystallographic structure factors and NMR experimental data. Aside from coordinates, each deposition also includes the names of molecules, primary and secondary structure information, sequence database references, where appropriate, and ligand and biological assembly information, details about data collection and structure solution, and bibliographic citations.

This comprehensive guide describes the "PDB format" used by the members of the worldwide Protein Data Bank (wwPDB; Berman, H.M., Henrick, K. and Nakamura, H. Announcing the worldwide Protein Data Bank. Nat Struct Biol 10, 980 (2003)). Questions should be sent to [email protected]

Information about file formats and data dictionaries can be found at http://wwpdb.org.

Version History:

Version 2.3: The format in which structures were released from 1998 to July 2007.

Version 3.0: Major update from Version 2.3; incorporates all of the revisions used by the wwPDB to integrate uniformity and remediation data into a single set of archival data files including IUPAC nomenclature. See http://www.wwpdb.org/docs.html for more details.

Version 3.1: Minor addenda to Version 3.0, introducing a small number of changes and extensions supporting the annotation practices adopted by the wwPDB beginning in August 2007 including chain ID standardization and biological assembly .

Version 3.15: Minor addenda to Version 3.20, introducing a small number of changes and extensions supporting the annotation practices adopted by the wwPDB beginning in October 2008 including DBREF, taxonomy and citation information.

Version 3.20: Current version, minor addenda to Version 3.1, introducing a small number of changes and extensions supporting the annotation practices adopted by the wwPDB beginning in December 2008 including DBREF, taxonomy and citation information.
September 15 2008, initial version 3.20.
November 15 2008, add examples for Refmac template and coordinate with alternate conformation.
December 24 2008, update REMARK 3 templates/examples, add Norine database in DBREF, update REMARK 500 on chiral center.
February 12 2009, update example in REMARK 210 and record format in NUMMDL.
July 6 2009, update description for REVDAT, DBREF2, MASTER and extend number of columns for AUTHOR, JRNL, CAVEAT, KEYWDS, etc.
December 22, 2009, update CAVEAT and REMARK 265.
April 21, 2010, update REMARK 5 and add BUSTER-TNT template in REMARK 3.
December 06, 2010, update maximum number of atoms for model. Update REMARK 3 with B value type for Refmac template.
March 30, 2011, correct description and examples for FORMUL and CONECT records. Change template in REMARK 630.

Basic Notions of the Format Description

Character Set

Only non-control ASCII characters, as well as the space and end-of-line indicator, appear in a PDB coordinate entry file. Namely:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
` - = [ ] \ ; ' , . / ~ ! @ # $ %  ^ & * ( ) _ + { } | : " < > ?

The use of punctuation characters in the place of alphanumeric characters is discouraged.
The space, and end-of-line:. The end-of-line indicator is system-specific character; some systems may use a carriage return followed by a line feed, others only a line-feed character.

Special Characters

Greek letters are spelled out,  i.e., alpha, beta, gamma, etc.
Bullets are represented as  (DOT).
Right arrow is represented as  -->.
Left arrow is represented as  <--.
If "="  is surrounded by at least one space on each side, then it is assumed to be an  equal sign, e.g., 2 + 4 = 6.

Commas, colons, and semi-colons are used as list delimiters in records that have one of the following data types:

List
SList
Specification List
Specification

If a comma, colon, or semi-colon is used in any context other than as a delimiting character, then the character must be escaped, i.e., immediately preceded by a backslash, "\".

Example - Use of “\” character:

COMPND         MOL_ID: 1;
COMPND      2  MOLECULE: GLUTATHIONE SYNTHETASE;
COMPND      3  CHAIN:  A;
COMPND      4  SYNONYM: GAMMA-L-GLUTAMYL-L-CYSTEINE\:GLYCINE LIGASE
COMPND      5  (ADP-FORMING);
COMPND      6  EC:  6.3.2.3;
COMPND      7  ENGINEERED: YES

COMPND         MOL_ID: 1;
COMPND      2  MOLECULE: S-ADENOSYLMETHIONINE SYNTHETASE;
COMPND      3  CHAIN:  A, B;
COMPND      4  SYNONYM: MAT, ATP\:L-METHIONINE S-ADENOSYLTRANSFERASE;
COMPND      5  EC:  2.5.1.6;
COMPND      6  ENGINEERED: YES;
COMPND      7  BIOLOGICAL_UNIT: TETRAMER;
COMPND      8  OTHER_DETAILS: TETRAGONAL MODIFICATION

Record Format

Every PDB file is presented in a number of lines. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of- line indicator.

Each line in the PDB file is self-identifying. The first six columns of every line contains a record name, that is left-justified and separated by a blank. The record name must be an exact match to one of the stated record names in this format guide.

The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines.

Each record type is further divided into fields.

Each record type is detailed in this document. The description of each record type includes the following sections:

Overview
Record Format
Details
Verification/Validation/Value Authority Control
Relationship to Other Record Types
Examples
Known Problems

For records that are fully described in fixed column format, columns not assigned to fields must be left blank.

Types of Records

It is possible to group records into categories based upon how often the record type appears in an entry.

One time, single line: There are records that may only appear one time and without continuations in a file. Listed alphabetically, these are:

RECORD TYPE             DESCRIPTION
------------------------------------------------------------------------------------
CRYST1                  Unit cell parameters, space group, and Z.
END                     Last record in the file.
HEADER                  First line of the entry, contains PDB ID code,
                        classification, and date of deposition.
NUMMDL                  Number of models.
MASTER                  Control record for bookkeeping.
ORIGXn                  Transformation from orthogonal  coordinates to the
                        submitted coordinates (n = 1, 2, or 3).
SCALEn                  Transformation from orthogonal coordinates to fractional
                        crystallographic coordinates  (n = 1, 2, or 3).

It is an error for a duplicate of any of these records to appear in an entry.

One time, multiple lines: There are records that conceptually exist only once in an entry, but the information content may exceed the number of columns available. These records are therefore continued on subsequent lines. Listed alphabetically, these are:

RECORD TYPE             DESCRIPTION
-----------------------------------------------------------------------------------
AUTHOR                  List of contributors.
CAVEAT                  Severe error indicator.
COMPND                  Description of macromolecular contents of the entry.
EXPDTA                  Experimental technique used for the structure determination.
MDLTYP                  Contains additional annotation  pertinent to the coordinates
                        presented  in the entry.
KEYWDS                  List of keywords describing the macromolecule.
OBSLTE                  Statement that the entry has been removed from distribution
                        and list of the ID code(s) which replaced it.
SOURCE                  Biological source of macromolecules in the entry.
SPLIT                   List of PDB entries that compose a larger  macromolecular
                        complexes.
SPRSDE                  List of entries obsoleted from public release and replaced by
                        current entry.
TITLE                   Description of the experiment represented in the entry.

The second and subsequent lines contain a continuation field, which is a right-justified integer. This number increments by one for each additional line of the record, and is followed by a blank character.

Multiple times, one line: Most record types appear multiple times, often in groups where the information is not logically concatenated but is presented in the form of a list. Many of these record types have a custom serialization that may be used not only to order the records, but also to connect to other record types. Listed alphabetically, these are:

RECORD TYPE DESCRIPTION
-----------------------------------------------------------------------------------
ANISOU Anisotropic temperature factors.
ATOM Atomic coordinate records for standard groups.
CISPEP Identification of peptide residues in cis conformation.
CONECT Connectivity records.
DBREF Reference to the entry in the sequence database(s).
HELIX Identification of helical substructures.
HET Identification of non-standard groups heterogens).
HETATM Atomic coordinate records for heterogens.
LINK Identification of inter-residue bonds.
MODRES Identification of modifications to standard residues.
MTRIXn Transformations expressing non-crystallographic symmetry
(n = 1, 2, or 3). There may be multiple sets of these records.
REVDAT Revision date and related information.
SEQADV Identification of conflicts between PDB and the named
sequence database.
SHEET Identification of sheet substructures.
SSBOND Identification of disulfide bonds.

Multiple times, multiple lines: There are records that conceptually exist multiple times in an entry, but the information content may exceed the number of columns available. These records are therefore continued on subsequent lines. Listed alphabetically, these are:

RECORD TYPE             DESCRIPTION
-------------------------------------------------------------------------------
FORMUL                  Chemical formula of non-standard groups.
HETNAM                  Compound name of the heterogens.
HETSYN                  Synonymous compound names for heterogens.
SEQRES                  Primary sequence of backbone residues.
SITE                    Identification of groups comprising important entity sites.

The second and subsequent lines contain a continuation field which is a right-justified integer.
This number increments by one for each additional line of the record, and is followed by a blank character.

Grouping: There are three record types used to group other records.
Listed alphabetically, these are:

RECORD TYPE             DESCRIPTION
------------------------------------------------------------------------------------
ENDMDL                  End-of-model record for multiple structures in a single
                        coordinate entry.
MODEL                   Specification of model number for multiple structures in a
                        single coordinate entry.
TER                     Chain terminator.

The MODEL/ENDMDL records surround groups of ATOM, HETATM, ANISOU, and TER records. TER records indicate the end of a chain.

Other: The remaining record types have a detailed inner structure.
Listed alphabetically, these are:

RECORD TYPE             DESCRIPTION
-----------------------------------------------------------------------------------
JRNL                    Literature citation that defines the coordinate set.
REMARK                  General remarks; they can be structured or free form.

PDB Format Change Policy

The wwPDB will use the following protocol in making changes to the way PDB coordinate entries are represented and archived. The purpose of the policy is to allow ample time for everyone to understand these changes and to assess their impact on existing programs. PDB format modifications are necessary to address the changing needs of PDB users as well as the changing nature of the data that is archived.

Comments and suggestions will be solicited from the community on specific problems and data representation issues as they arise.

Proposed format changes will be disseminated through [email protected] and wwpdb.org.

A 60-day discussion period will follow the announcement of proposed changes. Comments and suggestions must be received within this time period. Major changes that are not upwardly compatible will be allotted up to twice the standard amount of discussion time.

The wwPDB will then work in consultation with the wwPDB Advisory Committee and the equivalent partner Scientific Advisory Committees to evaluate and reconcile all suggestions. The final decision will be officially announced via [email protected] and wwpdb.org.

Implementation will follow official announcement of the format change. Major changes will not appear in PDB files earlier than 60 days after the announcement, allowing sufficient time to modify files and programs.

Order of Records

All records in a PDB coordinate entry must appear in a defined order. Mandatory record types are present in all entries. When mandatory data are not provided, the record name must appear in the entry with a NULL indicator. Optional items become mandatory when certain conditions exist. Record order and existence are described in the following table:

RECORD TYPE             EXISTENCE           CONDITIONS IF  OPTIONAL
--------------------------------------------------------------------------------------
HEADER                  Mandatory
OBSLTE                  Optional            Mandatory in  entries that have been
                                            replaced by a newer entry.
TITLE                   Mandatory
SPLIT                   Optional            Mandatory when  large macromolecular
                                            complexes  are split into multiple PDB
                                            entries.
CAVEAT                  Optional            Mandatory when there are outstanding  errors
                                            such  as chirality.
COMPND                  Mandatory
SOURCE                  Mandatory
KEYWDS                  Mandatory
EXPDTA                  Mandatory
NUMMDL                  Optional            Mandatory for  NMR ensemble entries.
MDLTYP                  Optional            Mandatory for  NMR minimized average
                                            Structures or when the entire  polymer
                                            chain contains C alpha or P atoms only.
AUTHOR                  Mandatory
REVDAT                  Mandatory
SPRSDE                  Optional            Mandatory for a replacement entry.
JRNL                    Optional            Mandatory for a publication describes
                                            the experiment.
REMARK 0                Optional            Mandatory for a re-refined structure
REMARK 1                Optional
REMARK 2                Mandatory
REMARK 3                Mandatory
REMARK N                Optional            Mandatory under certain conditions.
DBREF                   Optional            Mandatory for all polymers.
DBREF1/DBREF2           Optional            Mandatory when certain sequence  database
                                            accession  and/or sequence numbering
                                            does  not fit preceding DBREF format.
SEQADV                  Optional            Mandatory if sequence  conflict exists.
SEQRES                  Mandatory           Mandatory if ATOM records exist.
MODRES                  Optional            Mandatory if modified group exists  in the
                                            coordinates.
HET                     Optional            Mandatory if a non-standard group other
                                            than water appears in the coordinates.
HETNAM                  Optional            Mandatory if a non-standard group other
                                            than  water appears in the coordinates.
HETSYN                  Optional
FORMUL                  Optional            Mandatory if a non-standard group or
                                            water appears in the coordinates.
HELIX                   Optional
SHEET                   Optional
SSBOND                  Optional            Mandatory if a  disulfide bond is present.
LINK                    Optional            Mandatory if  non-standard residues appear
                                            in a  polymer
CISPEP                  Optional
SITE                    Optional
CRYST1                  Mandatory
ORIGX1 ORIGX2 ORIGX3    Mandatory
SCALE1 SCALE2 SCALE3    Mandatory
MTRIX1 MTRIX2 MTRIX3    Optional            Mandatory if  the complete asymmetric unit
                                            must  be generated from the given coordinates
                                            using non-crystallographic symmetry.
MODEL                   Optional            Mandatory if more than one model
                                            is  present in the entry.
ATOM                    Optional            Mandatory if standard residues exist.
ANISOU                  Optional
TER                     Optional            Mandatory if ATOM records exist.
HETATM                  Optional            Mandatory if non-standard group exists.
ENDMDL                  Optional            Mandatory if MODEL appears.
CONECT                  Optional            Mandatory if non-standard group appears
                                            and  if LINK or SSBOND records exist.
MASTER                  Mandatory
END                     Mandatory

Sections of an Entry

The following table lists the various sections of a PDB entry (version 3.2) and the records within it:

SECTION                 DESCRIPTION                       RECORD TYPE
-------------------------------------------------------------------------------------
Title                   Summary descriptive remarks       HEADER,  OBSLTE, TITLE, SPLIT,
                                                          CAVEAT, COMPND, SOURCE, KEYWDS,
                                                          EXPDTA,  NUMMDL, MDLTYP, AUTHOR,
                                                          REVDAT,  SPRSDE, JRNL
Remark                  Various comments about entry      REMARKs  0-999
Annotations             in more depth than standard
                        records
Primary structure       Peptide and/or nucleotide         DBREF, SEQADV, SEQRES MODRES
                        sequence and the
                        relationship between the PDB
                        sequence and that found in
                        the  sequence database(s)
Heterogen               Description of non-standard       HET, HETNAM, HETSYN, FORMUL
                        groups
Secondary structure     Description of secondary          HELIX, SHEET
                        structure
Connectivity            Chemical connectivity             SSBOND, LINK, CISPEP
annotation
Miscellaneous           Features within the               SITE
features                macromolecule
Crystallographic        Description of the                CRYST1
                        crystallographic cell
Coordinate              Coordinate transformation         ORIGXn,  SCALEn, MTRIXn,
transformation          operators
Coordinate              Atomic coordinate data            MODEL, ATOM, ANISOU,
                                                          TER, HETATM, ENDMDL
Connectivity            Chemical connectivity             CONECT
Bookkeeping             Summary information,              MASTER, END
                        end-of-file marker

Field Formats and Data Types

Each record type is presented in a table which contains the division of the records into fields by column number, defined data type, field name or a quoted string which must appear in the field, and field definition. Any column not specified must be left blank.

Each field contains an identified data type that can be validated by a program. These are:

DATA TYPE DESCRIPTION
----------------------------------------------------------------------------------
AChar An alphabetic character (A-Z, a-z).
Atom Atom name.
Character Any non-control character in the ASCII character set or a
space.
Continuation A two-character field that is either blank (for the first
record of a set) or contains a two digit number
right-justified and blank-filled which counts continuation
records starting with 2. The continuation number must be
followed by a blank.
Date A 9 character string in the form DD-MMM-YY where DD is the
day of the month, zero-filled on the left (e.g., 04); MMM is
the common English 3-letter abbreviation of the month; and
YY is the last two digits of the year. This must represent
a valid date.
IDcode A PDB identification code which consists of 4 characters,
the first of which is a digit in the range 0 - 9; the
remaining 3 are alpha-numeric, and letters are upper case
only. Entries with a 0 as the first character do not
contain coordinate data.
Integer Right-justified blank-filled integer value.
Token A sequence of non-space characters followed by a colon and a
space.
List A String that is composed of text separated with commas.
LString A literal string of characters. All spacing is significant
and must be preserved.
LString(n) An LString with exactly n characters.
Real(n,m) Real (floating point) number in the FORTRAN format Fn.m.
Record name The name of the record: 6 characters, left-justified and
blank-filled.
Residue name One of the standard amino acid or nucleic acids, as listed
below, or the non-standard group designation as defined in
the HET dictionary. Field is right-justified.
SList A String that is composed of text separated with semi-colons.
Specification A String composed of a token and its associated value
separated by a colon.
Specification List A sequence of Specifications, separated by semi-colons.
String A sequence of characters. These characters may have
arbitrary spacing, but should be interpreted as directed
below.
String(n) A String with exactly n characters.
SymOP An integer field of from 4 to 6 digits, right-justified, of
the form nnnMMM where nnn is the symmetry operator number and
MMM is the translation vector.

To interpret a String, concatenate the contents of all continued fields together, collapse all sequences of multiple blanks to a single blank, and remove any leading and trailing blanks. This permits very long strings to be properly reconstructed.