Statistics

 

Protein-peptide complexes

clusterT2_75_website.png

Figure 1. Protein-peptide dataset. PepX contains 505 unique protein-peptide interface clusters from 1431 PDBs, representing the diversity of structural information on protein-peptide complexes available in the PDB. 47% of all protein-peptide complexes available from the PDB are clustered within only 10 classes, containing complexes with peptides bound to MHC, thrombins, α-ligand binding domains, SH3 domains, PDZ domains and other.

lengthLigand.png

Figure 2. Distribution of ligand size in the database as the percentage of complexes for each ligand length. The smallest ligand considered is 5 amino acids long, the longest consists of 35 residues. Circa 70% of all peptides lies within the [5-15] residue range.

lengthReceptor.png

Figure 3. Distribution of receptor size in the database as the percentage of complexes for each receptor length. The largest protein in the complexes contains 2552 amino acid residues; the shortest considered is 35 residues long. Most proteins are smaller than 600 residues, with a peak in the [300-400] range.

cdhit_fullPepx.png

 Figure 4. Receptor sequence redundancy within the PepX database for all complexes (blue) and the centroid set (red). The receptor sequences in the PepX database were clustered with the cd-hit algorithm for various thresholds of sequence identity, from removing identical sequences up to 40% sequence identity. Although there is large sequence redundancy within the database, this does not always reflect a redundancy in binding modes.  For instance, removing only identical sequences (100%) results in a loss of more than 60% of all complexes and more than 20% of the centroids, showing that some receptors bind in different structural modes.

Clusters

 

distrClusterSize.png

Figure 5. Distribution of number of elements in the PepX clusters for various thresholds of structural similarity (1-2-3 Angstrom) and binding site alignment (50 % (A), 75% (B)  and 95% (C)). For all settings the largest number of clusters contains only one complex, going from 63% of all clusters (S1A, 50% and 3Å) to 87% of all clusters (S1C, 95% and 1Å). 

Annotations

General

annotations.png

Figure 6. General annotation statistics. Percentage of receptors in the PepX database reprented by different annotations: SCOP, CATH, Pfam and UniProt.  Coverage is highest for UniProt (>80%), followed by structural classifications by CATH (ca 70%) and SCOP (ca 55%),  and finally protein family annotation by Pfam (ca 50%).  

SCOP

scop_hier.png

Figure 7. Population of the SCOP hierarchy with protein-peptide complexes. Although most SCOP classes are represented by receptors in the database, protein-peptide complexes do not represent the full range of SCOP folds, superfamilies and families.

distr_scop.png

Figure 8. Distribution of structures in the different SCOP classes for the PepX database (A) and the full SCOP database (B). Whereas the α, β,  α/β and α+β classes are of similar size in the full SCOP database, the all-β and α+β proteins are overrepresented in PepX. 

 

CATH

cath_hier.png

 

Figure 9. Protein-peptide complexes in the CATH hierarchy. Every CATH class is represented by complexes, and architectures are highly represented as well (50%). In contrast, at lower CATH levels, less than 10% of both topologies and superfamilies hold at least one protein-peptide complex.

distr_cath.png

Figure 10. Distribution of structures in the different CATH classes. In accordance with the SCOP classification, classes with mainly β-structures are largely overrepresented.  Alpha and beta structures are underrepresented (35% in PepX versus 52% full CATH), which is also seen in SCOP when we merge the classes together (α/β and α+β), although the difference is smaller (43% PepX versus 49% full SCOP).