Structural Genomics: a Slice of the
Proteomics Pie
An ambitious effort to speed protein analysis and to
discern structural principles focuses, in part, on microbial proteins
Alisa Zapp Machalek
If bacterial genetics helped lead to sequencing the
human genome, the enthusiasm that first focused narrowly on human
genomics broadened and led to a surge in microbial genomics. Now, a good
deal of like-minded enthusiasm for large-scale analytic ventures is
shifting to "proteomics"in simple terms, a wholesale
attempt to understand just about everything about proteins. Within this
gargantuan and in some ways amorphous undertaking, structural genomics
represents a sector in which microbes provide a ready means to address
questions about proteins that eventually can be applied more broadly to
those found within a wide range of more complex organisms.
Figure 1
Structural genomics entails a synergistic intertwining
of structural biology, genomics, and computational modeling. Its
practitioners are committed to cranking out, at industrial speed,
thousands of carefully selected structures for proteins from which most
others can be predicted computationally. Why determine protein
structures on such a scale? Genes provide recipes for proteins, which do
most of the real cellular work. To understand how that work proceeds, it
is essential to move beyond the recipes by analyzing the
three-dimensional structures of many proteins. Such structures, usually
displayed as computer-generated images, reveal important structural
features of proteins, such as the relative locations of component atoms
that determine the surface landscape, inner architecture,
electrochemical properties, and, for enzymes, the active site.
Detailed protein structures can teach biological lessons
in three-dimensional space. For instance, by comparing the structures of
two proteins that do the same job in different organisms, researchers
can discover which amino acids or surface features are critical to the
function of the proteins. Such studies also often suggest evolutionary
relationships between proteins and, by extension, between organisms from
which they derive. And, if the proteins are associated with a disease or
come from a pathogenic microbe, their structures often become candidates
for structure-based drug design efforts.
Structural genomics also could help to crack the
long-sought protein-folding codeso far an important but elusive goal
within biophysics. If at all coherent, this code could more readily
reveal how genetic sequences determine the final, folded shape of
corresponding proteins. Because deciphering gene sequences usually is
much easier than solving protein structures, access to such a code would
greatly accelerate efforts to predict protein structures and determine
how they serve biological processes.
Structural Genomics Efforts Are International in
Scope, with Local Flair
Currently, there is both public and private-sector
funding for structural genomics projects, with programs under way in the
United States, several countries within the European Union, Japan,
China, Canada, and Israel. Much of the private-sector focus among
pharmaceutical and biotech companies conducting structural genomics
research is on drug discovery projects.
Officials at the National Institute of General Medical
Sciences (NIGMS), a component of the National Institutes of Health in
Bethesda, Md., launched a structural genomics program in September 2000.
Under its Protein Structure Initiative (PSI), NIGMS expects to spend
almost $200 million during the initial five-year pilot phase, with
tentative plans for five additional years of larger-scale, high-speed
structural genomics, according to John Norvell, who directs the PSI. The
knowledge that will be generated is expected to be put to "a
variety of practical applications in structure-based drug design,
biotechnology, agriculture, and development of new medical
devices," he says.
Protein Data Bank
NIGMS currently supports seven structural genomics pilot
research centers (see box, p. 442), including one cofunded by the
National Institute of Allergy and Infectious Diseases, and additional
centers are expected to be funded shortly. The pilot centers include
hundreds of researchers in several countries, who will work on
developing new techniques to streamline and accelerate every step in
structural genomicsfrom choosing which protein structures to solve,
to cloning and purifying the proteins, determining the structures, and
depositing the data into the Protein Data Bank (PDB), an online database
of macromolecular structures.
Traditional approaches to determining protein structures
often prove difficult and time-consuming. Currently, it requires at
least many weeks and, sometimes, several years and an average of
$100,000 per molecule to solve the structure of a moderately sized,
soluble protein. Complex, multisubunit proteins and membrane proteins,
which constitute up to a third of all proteins, can be significantly
more recalcitrant to such analysis. Within the pilot period, each NIGMS-funded
center should ramp up its operations to solve 100 to 200 structures a
year and significantly reduce the cost per structure, says Norvell.
Fruits of these Research Efforts Represent an
Important Public Resource
Figure 2
"Structural genomics is a shortcut to determining
all the protein shapes that exist in nature," says Norvell of NIGMS.
This overall approach relies on a belief in nature's economythat
countless different proteins found throughout nature fold into a limited
number of shapes and that all, or nearly all, natural protein structures
can be described based on these shapes.
"One long-term goal [of the NIGMS project] is to
develop a library of all these shapes," says Norvell. This library,
whose data will be freely available to the scientific community, will
integrate structural and genetic information and any available
biochemical information for each protein entry. Key to this strategy is
grouping proteins into "families" with similar structures
inferred from their amino acid sequences. Then, based on the known
structure of at least one protein in a family and using a computational
technique called homology modeling, researchers can make an incisive
guess about the shapes of other proteins in the family.
If things go as planned, researchers will be able to use
the library to calculate the approximate structure of any unknown
protein. All they'll have to do is "plug in" the protein's
gene sequence and "look up" its predicted structurealong
with any associated information about its likely function. Sometimes,
however, different amino acid sequences form the same three-dimensional
structure. On the flip side, some proteins, such as prions, consisting
of the same sequence, can fold into distinct conformations.
But these examples are outliers. Most structural
biologists believe that they can reasonably model a protein based on the
structure of another protein whose sequence is at least 30% identical.
And as structural information accumulates, these modeling programs are
bound to improve. To build the public library of proteins, NIGMS expects
that its structural genomics centers will determine the structures of
one or two representative proteins from each of thousands of different
structural families. Over 10 years, this effort should yield around
10,000 unique protein structures.
Although the PDB now boasts about 15,000 structures,
fewer than 4,000 of these represent unique proteins. The others consist
of minorbut importantvariations. Even more telling is that the
database contains representatives of only about 1,500 of the estimated
tens of thousands of families. The PSI goal of solving 10,000 unique
protein structures from almost as many families would more than triple
the number of unique structures available and would significantly
broaden the coverage of structural families. "Our goal is to have
complete coverage of protein structure space," Norvell says. One
catch at this early stage is that there are many different ways to group
proteins into families. However, the five-year pilot period of the NIGMS
effort should provide time to determine whether any categorizing method
is better than the others.
Another focus of the project is to identify new
"folds." Proteins with the same fold have similar overall
shapes but no detectable sequence similarity. Such proteins have the
same types of structural components connected in the same order.
Examples include zinc fingers and specific a-ß barrels and
4-helix bundles. Studying folds could reveal the physical and chemical
principles that determine how proteins form their three-dimensional
structures. Scientists estimate there are only a few thousand folds, and
only 700 of these are represented in the PDB.
Andrzej Joachimiak, a crystallographer at Argonne
National Laboratory in Argonne, Ill., who leads the Midwest Center for
Structural Genomics, envisions that once researchers have discovered
many of the folds found among natural proteins, they'll be able to
synthesize new ones with useful properties. "Once we know the
structures of about 20,000 different proteins, I think we'll be able to
generate new folds," he says. "Some people say we're
restricted to 2,000 folds. But why? If you remove this restriction, what
can we produce?"
One potential way for generating new folds is to use
nonnatural amino acids. "We might be able to make much better
enzymes if we start using nonnatural amino acids," such as those
that contain phosphate, fluoride, or other halogens, Joachimiak says.
"We may be able to do different reactions that require higher
energies." And, once scientists can better predict the fold of a
protein from its sequence, they'll be able to calculate the behavior of
an artificial protein without having to synthesize and test it, he adds.
"It will open a totally new era in protein chemistry and
catalysis."
Protein Analysis Moving from Hand-Made to
Mass-Produced
To speed protein analysis, participating researchers are
developing new procedures or robots to accomplish tasks that
traditionally required painstaking efforts by individual researchersoften,
lots of them. The goal is rather like advancing from hand-crafted
automobiles to assembly line production, providing plenty of cars for
the general population and, in turn, changing the entire transportation
system. A key early hurdle for researchers working in structural
genomics is to identify and eliminate important analytic bottlenecks.
Figure 3
"Probably the [tightest] bottleneck right now is
the purification of proteins," says Thomas Terwilliger, a
crystallographer at Los Alamos National Laboratory who leads the
tuberculosis (TB) Structural Genomics Consortium. He spoke during the
Second International Structural Genomics Meeting, held in April 2001 at
Airlie Conference Center in Virginia near Washington, D.C. (see box,
above). To overcome some of these common difficulties encountered in
isolating proteins, members of the group are using a technique developed
by Geoffrey Waldo, a protein chemist at Los Alamos and a member of the
TB Consortium.
Waldo's technique involves marking well-behaved proteins
with the green fluorescent protein from jellyfish and using robotics to
generate thousands of slightly different versions of the protein and
then identifying which of those versions is best suited for
crystallographic analysis. "The success of these efforts and those
of many other researchers around the world is one of the big reasons why
we think that large-scale structure determination will be
possible," Terwilliger says.
Bi-Cheng Wang, a crystallographer at the University of
Georgia in Athens, Ga., who leads the Southeast Collaboratory for
Structural Genomics, also strongly emphasizes technology development.
"We want to create a recipe for the best way to collect data,"
Wang says. He developed a technique he calls "direct
crystallography" to help resolve phase ambiguitiesan important
and perhaps the greatest problem researchers face when analyzing protein
crystallographic data. "I suggest we can remove the phase ambiguity
by computing," he says. The typically used but more laborious
procedure for doing so requires collecting additional data from fresh
crystals of a protein that are doped with selenium or a heavy metal.
"[The traditional method] is like a 10-course
Chinese dinner," Wang says. "It's very good, but it requires a
lot of time. The direct method is like going to McDonald'syou can get
satisfied in much less time. Once people see that you can achieve
results very quickly, the culture will change."
Participants Given Latitude when Choosing Target
Proteins
Leaders of the structural genomics centers are free to
choose their target proteins, but NIGMS requires that they use a
genome-directed approach in doing so. The TB Structural Genomics
Consortium is the only NIGMS-supported center focused on advancing
efforts to treat a major public health threat. "If scientists and
pharmaceutical companies had access to the structures of hundreds of TB
proteins, they might be able to use them to design new anti-TB drugs
more effectively," Terwilliger says, pointing out that the
consortium already identified about 1,000 TB proteins that seem
especially well suited as anti-TB drug targets.
Several of the other centers focus on microbes as model
systems to understand basic structure-function relationships and to
develop structural genomics approaches. For instance, members of the
Southeast Collaboratory for Structural Genomics plan to determine the
structure of every protein in Pyrococcus furiosus, a
hyperthermophilic anaerobe first isolated from the hot underwater slopes
of a volcano in Italy. By setting their sights on the entire proteome of
P. furiosusrather than tackling only the easy-to-determine
protein structuresthe group will "explore the full breadth of
obstacles to high-throughput structure determination," says its
leader Wang. P. furiosus has a relatively small genome containing
about 2,200 open reading frames (ORFS). "We feel it is an excellent
model prokaryote system for us to use to learn about protein folding,
since it may contain all possible protein folds in its small but
complete genome," he says.
Sung-Hou Kim, who leads the Berkeley Structural Genomics
Center, adopted a similar strategy in choosing to study the
"minimal genomes" of Mycoplasma genitalium and Mycoplasma
pneumoniae. Using a different approach, Joachimiak and his
collaborators at the Midwest Center for Structural Genomics are
comparing homologous proteins in dozens of microbes, including
pathogenic species such as Streptococcus pneumoniae, Haemophilus
influenzae, and Neisseria gonorrhoeae.
"We're interested in what makes an organism
unique," Joachimiak says. Specifically, his group is attempting to
tease out metabolic pathways that are characteristic of an individual
pathogen or class of organisms. "This is of particular importance
for drug design," he says, "because if there are critical
pathways, and we can inhibit any enzyme in one of these pathways, we can
inhibit that bug."
Even in its first flowering, structural genomics is
returning to its rootsmicrobial genetics. Its proponents predict that
the field will spur advances in areas ranging from protein chemistry to
structure-based drugs design and computational modeling. Although it is
too early to foresee its eventual impact, structural genomics promises
to deliver a bounty of data-and, very likely, to open up a whole new
chapter in biomedical research.