Foresight Update 6
page 3
A publication of the Foresight Institute
Protein
Engineering: An introduction to a newly recognized field
Adapted from: Protein Engineering Literature Scan #3
by James B. Lewis
With links added to 1997 WWW sources of information
Protein engineering is a new field; separate meetings, journals,
and books devoted to the topic have only appeared during the last
two to three years. One of the earliest uses of the term was in
an article contained in a special issue of Science
(February 11, 1983) that was devoted to the new field of
biotechnology: "Protein Engineering" by Kevin M. Ulmer
(Science 219:666-671). This article is an
overview that describes how several advances in different fields
have made it possible to attempt to modify many properties of
proteins by combining information on three-dimensional structure
and classical protein chemistry with new methods of genetic
engineering and molecular graphics. Ulmer concludes that this
developing technology will be used to further academic
understanding of how proteins work, and to produce altered
proteins for improved commercial products. He also envisions
"...paving the way for designing novel enzymes from
first principles. Protein engineering thus represents the
first major step toward a more general capability for
molecular engineering which would allow us to structure
matter atom by atom."
--Kevin M. Ulmer
The basic set of ideas works something like this. You use
computer graphics to display the three dimensional structure,
which has been experimentally determined for the protein that you
are studying, or, failing that, of a protein that is sufficiently
closely related to serve as a model. You then combine calculation
and guesswork to decide what modifications in structure might
bring about a desired change in the properties of the protein.
You then produce this new protein in one of two ways. If it is a
very small protein, often called a peptide, you can synthesize it
chemically by using the Merrifield solid phase technique. This
has the additional advantage that you can use chemical building
blocks in addition to the ones used in biological systems. If the
protein is not very small, you can use the new techniques of
biotechnology to make a gene that will encode an altered protein.
One part of this technology embodies solid phase
oligodeoxynucleotide synthesis, a variant on the Merrifield
technique, to synthesize a small piece of DNA to encode the
altered portion of the protein. This small piece is then used to
mutate the natural gene to the gene for the desired protein using
the gene splicing methods of recombinant DNA. The altered gene is
introduced into a vector-host system of the type used in
biotechnology to abundantly produce proteins that are not
"well-expressed" (produced) in nature. This protein can
then be purified and studied to determine how well your
predictions worked. This cycle of "guess-experiment-guess
again" was not possible prior to the advent of these
techniques.
|
A major part of the
intellectual effort of protein engineering
is devoted to solving the "protein folding"
problem. |
|
A major part of the intellectual effort of protein engineering
is devoted to solving the "protein folding" problem.
Enzymology, a branch of classical biochemistry, leads to the idea
that a protein's function follows from its three-dimensional, or
"tertiary," structure, and further, that its
three-dimensional structure follows from the linear sequence of
amino acid residues that comprises its "primary"
structure. The linear sequence can be deduced from the DNA
sequence of the gene that encodes the protein. The relationship
between DNA and protein sequences is the genetic code, which was
cracked during the late 1950's and early 1960's. Trying to
understand how the primary structure determines the tertiary
structure of the protein is, however, very much an unsolved
problem at this time. It has often been referred to as "the
second half of the genetic code." It is unclear just how
much of the folding problem will have to be solved to permit the
design of novel proteins. Eric Drexler (in Engines of Creation
and a 1981 PNAS
article) has pointed out that natural proteins may embody
obscure sequence-structure relationships for evolutionary
reasons, and that it may thus be possible to develop a much
simpler sequence-structure code for designed proteins. Whatever
the size of the problem, solving some version of the protein
folding problem will be a key step to using protein engineering
to design first generation assemblers.
|
Webmaster's Note: An
excellent introduction to protein structure is available
on the Internet. The following links are to the primary
site at Birkbeck College in England, and to the American
mirror at Brookhaven National Laboratory:
|
|
A number of more recent overviews of protein engineering are
available. The Preface by Dale L. Oxender and C. Fred Fox, and
the Introduction by Carl O. Pabo to the book Protein
Engineering (Oxender and Fox, ed., Alan R. Liss, Inc., New
York) are brief statements of the origins of the field. A
somewhat more technical introduction is afforded by the article
"Protein Engineering" by R. J. Leatherbarrow and A. R.
Fersht (1986) in the inaugural issue of Protein Engineering
1:7-16. The latter article considers various techniques
used to produce desired mutations in the genes encoding proteins,
discusses several proteins that are being intensively studied
using these techniques, and summarizes the results of some of
these studies. An overview specifically targeted to using
chemical synthesis for small proteins instead of genetic
engineering techniques is "Protein engineering by Chemical
Means?" by R. E. Offord (1987), Protein Engineering
1:151-157.
To appreciate what is involved in protein engineering requires an
acquaintance with a number of fields. These include classical
biochemistry, especially of proteins; protein structure
determination, including new computer graphic methods to
represent protein structure and to calculate the effects of
different perturbations of the sequence upon the structure;
solid-phase methods for the chemical synthesis of proteins; the
new methods of genetic engineering, including both basic
molecular biology and the techniques of biotechnology. It is the
goal of this primer to provide a summary of some of this material
and a guide to further in-depth studies.
Modeling
Molecules
"Dynamic market emerging for molecular modeling" by
Mark Ratner. January 1989. BioTechnology 7:43-44,47.
The following is a summary of the above article:
Hardware and software to model, analyze, and simulate novel
molecular structures is still in a very formative stage. The
modeling process begins with the coordinates of the 3-dimensional
structure (solved by X-ray crystallography) obtained (for
proteins) from the Brookhaven
National Laboratory's Protein Data Bank (PDB). Only about 300
protein structures have been determined, so in many cases,
modeling must be attempted using a known structure that has a
similar sequence to the unknown structure.
The molecular graphics program then converts the 3-dimensional
coordinates into a picture of the molecule, which can be
manipulated on the computer monitor to see specific bonds and
other features of the structure, just as a physical model could
be handled to observe various features from all angles.
|
Webmaster's Note: One
molecular graphics visualization tool available over the
Internet is RasMol, developed by Roger Sayle. RasMol is
available for UNIX, VMS, Macintosh and Microsoft Windows
(OS/2 and Windows NT). RasMol
is available by ftp. Excellent sources of information
about RasMol are:
|
|
The next step is the
use of molecular mechanics programs (based on classical Newtonian
mechanics) that calculate forces among the various atoms and
minimize the overall energy of the conformation to calculate the
preferred actual structure of the protein. Advanced programs use
molecular dynamics to refine the structural calculations. The
most effective programs need to use a supercomputer for these
calculations. Consideration of nonbonded atomic states can
require 125 million floating point operations for a given energy.
Ab initio calculations to solve the (quantum mechanical)
Schrödinger equation can only deal with 10-20 atoms per molecule
because of computational limitations.
The market for molecular modeling packages appears to be in flux,
with too many packages and too few users at the moment. Software
suppliers presently include Polygen (Waltham, MA), Biosym
Technologies (San Diego, CA), and Tripos Associates (St. Louis,
MO). The latter package will soon include techniques to model by
homology--comparing structural motifs that occur frequently in
nature. This "knowledge based" approach includes an
analysis of the vast protein sequence (not 3-D structure)
database to find a useful set of related proteins whose
structures can be compared to model the unknown structure.
Manufacturers of hardware include SiliconGraphics (Mountain View,
CA) and Evans & Sutherland (Salt Lake City, UT).
|
Webmaster's Note: For more
current information on molecular modeling, see:
|
|
Molecular modeling may "provide the forum for chemists,
physicists, computer scientists, genetic engineers, and protein
purifiers to come together."
Molecular
Dynamics
In a sense, molecular dynamics is the most fundamental aspect
of the study of proteins (or any other molecules) from the
perspective of nanotechnology. This area deals with how each of
the constitutive atoms of molecules, large or small, move and
thus provides a time-evolving structural basis for considering
the properties of the molecules. If we wish to make molecular
machines, we have to understand how the parts move so that we can
make the machines function appropriately. I (JBL, 7/17/88) have
little knowledge of the subject, so I give here a few references
of places to get started.
"The Dynamics of Proteins" by Martin
Karplus and J. Andrew
McCammon. April 1986. Scientific American
254:42-51.
The following is a summary of the above article:
"The molecules essential to life are never at rest; they
would be unable to function if they were rigid. The internal
motions that underlie their workings are best explored in
computer simulations." This introduction begins by pointing
out the limitations of trying to understand in detail how
proteins function by knowing only the static structure of the
crystal, determined by X-ray crystallography (or occasionally in
solution by NMR), which represents only the time-averaged
structure of the protein.
Better understanding of how proteins function is provided by
theoretical studies, based on experimental structural
information, that lead to computer simulations of how the protein
actually moves. "The most direct approach to protein
dynamics is to treat each atom in the protein as a particle
responding to forces in the way prescribed by Newtonian physics,
in accord with Newton's equations of motion." Remember that
average-sized proteins contain 5000 or more atoms. Chemical bonds
can be treated like springs, and many weaker forces between
non-bonded atoms must be considered so that the force on each
atom depends upon the positions of every other atom in the
protein. The X-ray crystal structure gives the necessary
information to begin the simulation of atomic movements. However,
because the X-ray structure is an average structure, it turns out
to be a very unrealistic picture of the state of any particular
molecule at any particular time so that a very complex set of
calculations must be performed to arrive at such an
"equilibrated" protein.
|
If the atoms in myoglobin
were fixed in the positions found in
the X-ray-crystallographic structure, myoglobin would be
useless |
|
This latter structure is used as the starting point for
molecular dynamics simulations of how the molecule will behave.
These calculations use steps of the order of a femtosecond. The
best simulations follow the protein for as long as a nanosecond
(a million such steps), requiring hundreds of hours of
supercomputer time. The combination of many small local motions
of individual amino acid residues and their constituent atoms can
produce more global displacements of different parts of the
protein. What sorts of movements are important over what time
scales is discussed in general terms. The particular example of
myoglobin, the oxygen-binding protein in muscle, is discussed.
The striking point is made that "If the atoms in myoglobin
were fixed in the positions found in the X-ray-crystallographic
structure, myoglobin would be useless: the time required for an
oxygen molecule to bind to the heme group or to get out again
when needed would be much longer than a whale's lifetime"
(or the lifetime of the universe, for that matter). Simulations
showed instead how fluctuations in the positions of specific
atoms allowed the oxygen to diffuse through the structure in a
reasonable amount of time.
Also discussed is how critical parts of enzymatic reactions
usually occur over millisecond time scales, a million times as
long as can be handled with present computers. Specialized
approximations can sometimes be used and are discussed for a few
cases. These illustrate "the important role of small, high
frequency fluctuations in facilitating some larger and more
collective motions of proteins." Karplus predicts that
eventually these techniques will lead to the ability to calculate
the rates of enzymatic reactions and the binding of small
molecules to larger ones, thus providing better ways to modify
proteins for industrial purposes.
"Molecular dynamics simulations of proteins" by
Martin Karplus. October 1987. Physics Today pp.
68-72.
The following is a summary of the above article:
This review is a bit more technical and considers more the
interplay of calculation and experiment to provide more
meaningful results. For example, the role of NMR in studying
internal motions of proteins is discussed. Conversely, the
application of molecular dynamics methods to NMR data is quite
useful in deriving three-dimensional protein structures from the
data. This process is referred to as "restrained
dynamics." The take-home lesson is the same as for the above
review, with the list of expected future practical developments
expanded to include the design of inhibitors to cure diseases.
A real understanding of molecular dynamics, of course, can not be
gotten from brief review articles. A good text book is probably Dynamics
of Proteins and Nucleic Acids by Andrew McCammon and
Stephen C Harvey. Cambridge University Press, New York, 1987.
xii, 234 pp., illus. $39.50.
I say probably because I haven't seen it yet (let alone read it),
but I saw two very favorable reviews: one (titled "Good
Vibrations") by B. Robson in BioEssays, Volume
8, No. 2 , p. 93-93 (February 1988)--admittedly a periodical
published by the same publisher as the book--and the other
(titled "Biomolecular Processes," in the more prosaic Science
fashion) by R. M. Levy in Science, 8 July, 1988, 241:234-235.
Both reviews agree that it is a very well-organized book and an
excellent place to begin to try to understand the field. Despite
starting from basics, the book is said to provide the background
needed to read the current literature of the field. The book is
about the time-dependent motions of these vital molecules. These
motions range from small-amplitude atomic vibrations that occur
in 0.1 picoseconds to large-scale allosteric transitions that
occur in milliseconds to several seconds. The theoretical and
computational methods are clearly described, with most emphasis
on the nanosecond scale since computational limitations make
detailed calculations on longer scales impractical, but these
slower processes are discussed in general terms. I take it from
what the reviewers say that molecular dynamics approaches can now
attempt to predict the three-dimensional structure of small
peptides, but since a large protein takes on the order of a
second to fold, and current simulations are limited to about a
nanosecond scale, we have a factor of a billion to go in
predicting three-dimensional structure for large proteins.
Knowledge-Based
Structure Prediction
"Knowledge-based prediction of protein structures and
the design of novel molecules" by T. L.
Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton.
1987. Nature 326:347-352.
Abstract: "Prediction of the tertiary structures of
proteins may be carried out using a knowledge-based approach.
This depends on identification of analogies in secondary
structures, motifs, domains or ligand interactions between a
protein to be modeled and those of known three-dimensional
structures. Such techniques are of value in prediction of
receptor structures to aid in the design of drugs, herbicides or
pesticides, antigens in vaccine design, and novel molecules in
protein engineering."
The following is a summary of the above article:
After discussing the expected utility of structural knowledge in
applications from drug design to biological microchips, and
noting the fact that sequence information has increased much more
rapidly than 3-D structural information, this paper then
discusses the various steps involved in prediction of 3-D
structure from sequence:
Sequence Alignment
The first step is to compare the sequence of the protein whose
3-D structure is to be predicted with the known sequences of
other proteins available in the sequence database. Several
algorithms are available to do this. If the new sequence is
>25% similar to a sequence in the database, the match is
easily distinguished above the background of randomized
sequences. It is stated that an alignment score >6 standard
deviations above random alignment will give reliable prediction
of the secondary structures of most residues.
The Tertiary Structures of Homologous Proteins
In cases where the 3-D structures of homologous proteins are
known, structure is conserved in evolution more than is primary
protein sequence. Often changes are concentrated in the surface
loops of the protein. This observation provides the rationale for
using the known structure of the homologous protein to predict
the unknown structure.
Modeling by Homology
The aligned sequences are used to predict where one should
create insertions, deletions, and replacements in the known
structure. This is done using computer graphics; a widely used
program is called FRODO. Initial models are then refined by
energy minimization programs on the computer to avoid steric
clashes. References are given to research that has used this
approach.
Modeling Using Multiple Structures
Since only about 100 out of the 300 3-D structures in the
Brookhaven databank are nonhomologous, there is often more than
one structure available to use as a basis for modeling. Several
approaches for simultaneously using different model structures to
predict the unknown structure are discussed.
Insertions & Deletions in Loop Regions
Loops are the most difficult regions to construct because the
majority of significant differences occur in these regions.
Databases and examples for loop construction are discussed in
some detail. Particular attention is given to beta-hairpin loops
(loops between two adjacent antiparallel beta strands). Ab initio
calculations using molecular dynamics are recommended when no
structure sufficiently similar for use in modeling can be found.
Energy Minimization and Molecular Dynamics
"Where the proteins have sequence homology of 50% or
more, the models predicted by the methods described here will be
probably correct to better than 1 Angstrom although individual
side chains may be more in error." Some improvement in
accuracy can be had by using such energy minimization programs as
AMBER or CHARMM. Since energy minimization as it is now done
finds only a local minimum, it is only expected to be useful if
the errors in the starting structure are less than an Angstrom
[Note: This seems a quite stringent requirement to meet-JBL].
How Correct are the Models?
Several cases where modeled proteins have been subsequently
studied by X-ray are discussed, with the results shown to have
been mixed. It is suggested that the distributions of the
hydrophobic side-chains and the nature of the solvent-accessible
surfaces are the most sensitive indicators of the reasonableness
of the model. [It should be noted that this whole modeling
procedure, although useful in some situations, is still very
inexact and requires a great deal of experience and knowledge to
interpret.-JBL]
Future Developments
Two challenges are discussed: (1) To extend the method to
cases where there is no obvious sequence homology, but there is
reason to suspect that the structure is a member of a known
family of structural motifs, and (2) To design novel molecules.
"Knowledge-based prediction of protein structure"
by J. M. Thornton of Birkbeck College, England; from the Miami
meeting.
Dr. Thornton notes that the delicate balance between properly
folded and alternate structures of a protein has been impossible
to predict so far from energy minimization, so that people have
tried to use empirical predictions based on the 300 or so protein
structures that have been experimentally determined. These have
been of limited value. Even simple predictions of secondary
structure only, rather than complete tertiary structures, are
only about 60% accurate. By considering in more detail
characteristics of a particular type of secondary structure
(beta-beta hairpin turns), certain sequence features associated
with specific varieties of this structure were identified that
improved prediction a bit (to over 70%). This is progress, but
the empirical approach to protein sequence-structure
relationships has a long way to go before we can use it to help
design first generation assemblers. A general review of this
process of knowledge-based prediction of protein structure; i.e.
modeling the structure of an unknown protein based on the known
structure of a protein of similar sequence, was published last
year: "Knowledge-based prediction of protein structures and
the design of novel molecules" by T. L. Blundell, B. L.
Sibanda, M. J. E. Sternberg, J. M. Thornton. 1987. Nature
326:347-352. [NOTE: This paper is abstracted above.]
"Protein structure: the shape of things to
come?"--A "News and Views" editorial by Janet M.
Thornton in the 1 September 1988 issue of Nature
335:10-11.
The following is a summary of the above article:
She observes that attempts to predict structure from
sequence have been shifting from calculations using energy
functions to the "more pragmatic" structure recognition
by pattern matching, as exemplified by a paper by Rooman and
Wodak in the same issue (see below). "The good news is that
short sequence patterns which reliably define secondary structure
do exist. The bad news is that the prediction accuracy ... is
still only about 60 per cent." Apparently the main problem
is the relative scarcity of structural data. The 60% accuracy is
especially discouraging since the original attempts by Chou and
Fasman and by Garnier, both in 1978, achieved this level of
accuracy using only the 20 protein structures that were known
then (vs. >300 today), and used only the helix-or
sheet-forming properties of individual residues rather than those
of short sequences. Results are quoted from several years ago
that only 20% of identical pentapeptides in unrelated proteins of
known structure adopt the same secondary structure. The paper
below does an automated and systematic search of the structure
database, and identifies some peptides that are very predictive,
although most peptide sequences are not very predictive.
The reason that most patterns are not predictive is apparently
that most occur only a few times in the database so that patterns
can not be adequately recognized. It is suggested that sequence
patterns should occur about 15 times for accurate prediction,
while most 3-residue sequences occur < 3 times in the current
database. Rooman and Wodak speculate that a database of 1500
structures will be needed for adequate prediction of secondary
structure, which, optimistically, could take 20 years to produce.
Thornton suggests that prediction might be improved by (1)
incorporating recent sequence interpretation techniques designed
to recognize very distantly related proteins so that the known
structure of one could be used to model the other, and (2) using
what is known in some cases about elements of super-secondary
structure--motifs of clustered secondary structure elements
associated with particular classes of proteins. She also mentions
a recent article in which neural networks were trained to
recognize secondary structure from sequences, and got predictions
that were 64% accurate.
Jim Lewis is a molecular biologist at Oncogen in Seattle.
He is also the leader of the PATH HyperCard Project, a project of
the Seattle Nanotechnology Study Group, which is working on a
HyperCard stack on nanotechnology. The full text of Dr. Lewis's
summary from which this adaptation was made is available from the
Foresight Institute; send a stamped, self-addressed envelope with
65 cents postage.
From Foresight Update 6, originally
published 1 August 1989.
Foresight thanks Dave Kilbridge for converting Update 6 to
html for this web page.
|