xgrayce · com . . .

Bioinformatics: improving statistical prediction of PDZ domain ligands

[nNOS-syntrophin PDZ complex]

PDZ domains [ cf. Fanning, A. S. and Anderson, J. M., Cur. Opin. Cell Biol. 11, 432-430 (1999) ] are short stretches of protein thought to ``bolt'' together large protein ``scaffolds'' in cells. An interesting problem is trying to design an algorithm which quickly and reliably predicts whether a given ligand (sequence of amino acids) will bind (stick) to a particular PDZ domain. Good solutions to this problem would enhance the ability of biologists to deduce the natural binding partners of identified PDZ domains, and allow better design of artificial binding partners for chemotherapy.

The image at right is a ribbon model of the nNOS (green) syntrophin A (gold) PDZ heterodimer.


One approach would be to build an atomic-scale model of each system of interest and calculate from first principles the strength of attraction between the protein and a candidate ligand. That's not what we do here.
   Instead we explore the pure ``bioinformatics'' approach, where one tries to extract empirical rules for binding out of experimental data on biochemical library screens.
   An analogy: imagine all possible amino acid sequences are hands of cards dealt out in a game the rules of which we do not know. The library screens tell us which hands are ``winners'' (bind to PDZ domains) and which are ``losers.'' The task is to use this information to deduce the rules of the game, i.e. learn the rules that distinguish winning hands from losing.

The experimental database we use here includes observation culled from the literature on binding between 9 different PDZ domains (nNOS, INAD, and 7 domains on hINADL) and 216 polypeptide ligands. The proteins are multiply aligned with ClustalX so that the biochemical role of position n on each protein is presumably similar. Positions on the candidate amino acid sequences are numbered from ``0'' (zero) at the COOH terminus.

Position dependent scoring matrices

The simplest approach begins with calculating the frequencies of each type of amino acid at each position in a sequence observed to bind to a PDZ domain:

From this information we can devise a scoring matrix for testing unknown ligands. That is, we define a score S like so:

[equation]

where j numbers the amino acids in the peptide, and Pj(aj) is the experimental probability that a PDZ ligand has amino acid aj at position j. Why do we use the logarithm? From the reversible work theorem of statistical mechanics we know the probability of observing events (in this case binding) is proportional to the exponential of the reversible change in free energy required to bring them about. Hence conversely, the log of probabilities is directly proportional to (some) change in free energy. By choosing the logarithmic form for S we are implicitly adding together a free energy of binding associated with each ligand position.

Now we test. Our goal is to have the score S of amino acid sequences known to bind to PDZ domains be as different as possible from all other possible sequences. Here's a bar graph of the probability P(S) of observing a score S for (in red) the ligands that actually bind to PDZ domains, and (in blue) all other tripeptide amino acid sequences. Note the ordinate axis is logarithmic, and that the blue bar at far left goes way above the top of the graph. This bar represents the bulk of trial sequences, and corresponds to an ``absolutely reject'' score of minus infinity. It is drawn arbitrarily at S = -105 (instead of minus infinity) just so it fits on the graph.

The equation above, defining a position-dependent scoring matrix, is the standard approach, and it looks pretty good. If we pick a cutoff score Smin of about -80 to identify ligands, this graph suggests we can expect false positives (identifying a sequence as binding when it is not) and false negatives (failing to identify a ligand as binding) roughly 5% of the time.

But we can do better

These screening experiments include hidden treasure. Instead of only the probability of finding amino acid Aj at position j, let us examine the experimental probability of observing a pair of amino acids (Aj, Ak) at ligand position j and ligand position k.

What does this information tell us? Amino acids can be correlated or uncorrelated, meaning that the probability of observing Aj at j can depend (correlated) or not (uncorrelated) on whether Ak is at k. Correlation in this sense is the same idea expressed in genetics by the phrase ``linkage disequilibrium.''

If amino acids are correlated, then the probability of observing a given pair (Aj, Ak) will not be well-represented by the product of the separate probabilities of observing Aj and Ak. That is, we can expect the position-dependent scoring matrix used above to be inaccurate for certain amino acid combinations. We can overcome this by building a scoring matrix of higher dimensionality, which directly uses the pair probabilities. This will give us better selectivity in a case where, for example, Aj frequently appeared at j, and Ak frequently appeared at k, but the two never occured together in the same system.

Higher-dimensional scoring matrices

We can define a new, higher-dimensionality score S in three different ways, depending on whether we want to use information on pair probabilities within the ligand, between PDZ domain and ligand, or both. Here are the three possible definitions of S, and the corresponding plots of P(S) for all tripeptide ligands:

[equation]
where j and k number the amino acids in the peptide, and Pjk(aj,ak) is the experimental probability that a PDZ ligand has amino acid aj at position j and amino acid ak at position k.

[equation]
where J numbers the amino acids in the PDZ domain, and j numbers the amino acids in the peptide. PJj(aJ,aj) is the experimental probability that a PDZ protein bound to a ligand has amino acid aJ at position J in the PDZ domain, and amino acid aj at position j in the ligand. This is the method used by Brannetti et al. [ J. Mol. Biol. 298, p. 313 (2000) ] in the ingenious ``SPOT algorithm,'' except that instead of including all positions J on the protein, they only included those that a 3-D structure alignment suggested were within an arbitrary small distance of ligand atoms.

[equation]
where J numbers the amino acids in the PDZ domain, and both j and k number amino acids in the peptide. PJj(aJ,aj) and Pjk(aj,ak) are defined as above. Now we are using the most possible pair probability information.

Conclusion

In each graph a score S of -105.0 is arbitrarily assigned to any peptide for which a probability on the right-hand side of the equation for S equals zero. These ``completely rejected'' peptides account for the largest bar in the graph of P(S) for the nonbinding peptides, at the far left, rising above the top of each graph.

What we see is that the higher-dimensionality scoring matrices, which add information about pair probabilities, brutally weed out nonbinding amino acid sequences by greatly increasing the likelihood that one or more of the probabilities in the definition of S will be zero, causing nonbinding peptides to be ``absolutely rejected.'' For the improved method, we don't even need to pick a cutoff score Smin, because practically any sequence that doesn't bind is absolutely rejected. There is still a chance of a false negative or positive, but the probability of those undesirable outcomes has fallen by a factor of 100.

back from whence you came