ASM News
ASM Home Site Map Search ASM Site

Structural Genomics: a Slice of the Proteomics Pie

An ambitious effort to speed protein analysis and to discern structural principles focuses, in part, on microbial proteins

Alisa Zapp Machalek

If bacterial genetics helped lead to sequencing the human genome, the enthusiasm that first focused narrowly on human genomics broadened and led to a surge in microbial genomics. Now, a good deal of like-minded enthusiasm for large-scale analytic ventures is shifting to "proteomics"—in simple terms, a wholesale attempt to understand just about everything about proteins. Within this gargantuan and in some ways amorphous undertaking, structural genomics represents a sector in which microbes provide a ready means to address questions about proteins that eventually can be applied more broadly to those found within a wide range of more complex organisms.

Figure 1

Structural genomics entails a synergistic intertwining of structural biology, genomics, and computational modeling. Its practitioners are committed to cranking out, at industrial speed, thousands of carefully selected structures for proteins from which most others can be predicted computationally. Why determine protein structures on such a scale? Genes provide recipes for proteins, which do most of the real cellular work. To understand how that work proceeds, it is essential to move beyond the recipes by analyzing the three-dimensional structures of many proteins. Such structures, usually displayed as computer-generated images, reveal important structural features of proteins, such as the relative locations of component atoms that determine the surface landscape, inner architecture, electrochemical properties, and, for enzymes, the active site.

Detailed protein structures can teach biological lessons in three-dimensional space. For instance, by comparing the structures of two proteins that do the same job in different organisms, researchers can discover which amino acids or surface features are critical to the function of the proteins. Such studies also often suggest evolutionary relationships between proteins and, by extension, between organisms from which they derive. And, if the proteins are associated with a disease or come from a pathogenic microbe, their structures often become candidates for structure-based drug design efforts.

Structural genomics also could help to crack the long-sought protein-folding code—so far an important but elusive goal within biophysics. If at all coherent, this code could more readily reveal how genetic sequences determine the final, folded shape of corresponding proteins. Because deciphering gene sequences usually is much easier than solving protein structures, access to such a code would greatly accelerate efforts to predict protein structures and determine how they serve biological processes.

Structural Genomics Efforts Are International in Scope, with Local Flair

Currently, there is both public and private-sector funding for structural genomics projects, with programs under way in the United States, several countries within the European Union, Japan, China, Canada, and Israel. Much of the private-sector focus among pharmaceutical and biotech companies conducting structural genomics research is on drug discovery projects.

Officials at the National Institute of General Medical Sciences (NIGMS), a component of the National Institutes of Health in Bethesda, Md., launched a structural genomics program in September 2000. Under its Protein Structure Initiative (PSI), NIGMS expects to spend almost $200 million during the initial five-year pilot phase, with tentative plans for five additional years of larger-scale, high-speed structural genomics, according to John Norvell, who directs the PSI. The knowledge that will be generated is expected to be put to "a variety of practical applications in structure-based drug design, biotechnology, agriculture, and development of new medical devices," he says.

Protein Data Bank

NIGMS currently supports seven structural genomics pilot research centers (see box, p. 442), including one cofunded by the National Institute of Allergy and Infectious Diseases, and additional centers are expected to be funded shortly. The pilot centers include hundreds of researchers in several countries, who will work on developing new techniques to streamline and accelerate every step in structural genomics—from choosing which protein structures to solve, to cloning and purifying the proteins, determining the structures, and depositing the data into the Protein Data Bank (PDB), an online database of macromolecular structures.

Traditional approaches to determining protein structures often prove difficult and time-consuming. Currently, it requires at least many weeks and, sometimes, several years and an average of $100,000 per molecule to solve the structure of a moderately sized, soluble protein. Complex, multisubunit proteins and membrane proteins, which constitute up to a third of all proteins, can be significantly more recalcitrant to such analysis. Within the pilot period, each NIGMS-funded center should ramp up its operations to solve 100 to 200 structures a year and significantly reduce the cost per structure, says Norvell.

Fruits of these Research Efforts Represent an Important Public Resource

Figure 2

"Structural genomics is a shortcut to determining all the protein shapes that exist in nature," says Norvell of NIGMS. This overall approach relies on a belief in nature's economy—that countless different proteins found throughout nature fold into a limited number of shapes and that all, or nearly all, natural protein structures can be described based on these shapes.

"One long-term goal [of the NIGMS project] is to develop a library of all these shapes," says Norvell. This library, whose data will be freely available to the scientific community, will integrate structural and genetic information and any available biochemical information for each protein entry. Key to this strategy is grouping proteins into "families" with similar structures inferred from their amino acid sequences. Then, based on the known structure of at least one protein in a family and using a computational technique called homology modeling, researchers can make an incisive guess about the shapes of other proteins in the family.

If things go as planned, researchers will be able to use the library to calculate the approximate structure of any unknown protein. All they'll have to do is "plug in" the protein's gene sequence and "look up" its predicted structure—along with any associated information about its likely function. Sometimes, however, different amino acid sequences form the same three-dimensional structure. On the flip side, some proteins, such as prions, consisting of the same sequence, can fold into distinct conformations.

But these examples are outliers. Most structural biologists believe that they can reasonably model a protein based on the structure of another protein whose sequence is at least 30% identical. And as structural information accumulates, these modeling programs are bound to improve. To build the public library of proteins, NIGMS expects that its structural genomics centers will determine the structures of one or two representative proteins from each of thousands of different structural families. Over 10 years, this effort should yield around 10,000 unique protein structures.

Although the PDB now boasts about 15,000 structures, fewer than 4,000 of these represent unique proteins. The others consist of minor—but important—variations. Even more telling is that the database contains representatives of only about 1,500 of the estimated tens of thousands of families. The PSI goal of solving 10,000 unique protein structures from almost as many families would more than triple the number of unique structures available and would significantly broaden the coverage of structural families. "Our goal is to have complete coverage of protein structure space," Norvell says. One catch at this early stage is that there are many different ways to group proteins into families. However, the five-year pilot period of the NIGMS effort should provide time to determine whether any categorizing method is better than the others.

Another focus of the project is to identify new "folds." Proteins with the same fold have similar overall shapes but no detectable sequence similarity. Such proteins have the same types of structural components connected in the same order. Examples include zinc fingers and specific a-ß barrels and 4-helix bundles. Studying folds could reveal the physical and chemical principles that determine how proteins form their three-dimensional structures. Scientists estimate there are only a few thousand folds, and only 700 of these are represented in the PDB.

Andrzej Joachimiak, a crystallographer at Argonne National Laboratory in Argonne, Ill., who leads the Midwest Center for Structural Genomics, envisions that once researchers have discovered many of the folds found among natural proteins, they'll be able to synthesize new ones with useful properties. "Once we know the structures of about 20,000 different proteins, I think we'll be able to generate new folds," he says. "Some people say we're restricted to 2,000 folds. But why? If you remove this restriction, what can we produce?"

One potential way for generating new folds is to use nonnatural amino acids. "We might be able to make much better enzymes if we start using nonnatural amino acids," such as those that contain phosphate, fluoride, or other halogens, Joachimiak says. "We may be able to do different reactions that require higher energies." And, once scientists can better predict the fold of a protein from its sequence, they'll be able to calculate the behavior of an artificial protein without having to synthesize and test it, he adds. "It will open a totally new era in protein chemistry and catalysis."

Protein Analysis Moving from Hand-Made to Mass-Produced

To speed protein analysis, participating researchers are developing new procedures or robots to accomplish tasks that traditionally required painstaking efforts by individual researchers—often, lots of them. The goal is rather like advancing from hand-crafted automobiles to assembly line production, providing plenty of cars for the general population and, in turn, changing the entire transportation system. A key early hurdle for researchers working in structural genomics is to identify and eliminate important analytic bottlenecks.

Figure 3

"Probably the [tightest] bottleneck right now is the purification of proteins," says Thomas Terwilliger, a crystallographer at Los Alamos National Laboratory who leads the tuberculosis (TB) Structural Genomics Consortium. He spoke during the Second International Structural Genomics Meeting, held in April 2001 at Airlie Conference Center in Virginia near Washington, D.C. (see box, above). To overcome some of these common difficulties encountered in isolating proteins, members of the group are using a technique developed by Geoffrey Waldo, a protein chemist at Los Alamos and a member of the TB Consortium.

Waldo's technique involves marking well-behaved proteins with the green fluorescent protein from jellyfish and using robotics to generate thousands of slightly different versions of the protein and then identifying which of those versions is best suited for crystallographic analysis. "The success of these efforts and those of many other researchers around the world is one of the big reasons why we think that large-scale structure determination will be possible," Terwilliger says.

Bi-Cheng Wang, a crystallographer at the University of Georgia in Athens, Ga., who leads the Southeast Collaboratory for Structural Genomics, also strongly emphasizes technology development. "We want to create a recipe for the best way to collect data," Wang says. He developed a technique he calls "direct crystallography" to help resolve phase ambiguities—an important and perhaps the greatest problem researchers face when analyzing protein crystallographic data. "I suggest we can remove the phase ambiguity by computing," he says. The typically used but more laborious procedure for doing so requires collecting additional data from fresh crystals of a protein that are doped with selenium or a heavy metal.

"[The traditional method] is like a 10-course Chinese dinner," Wang says. "It's very good, but it requires a lot of time. The direct method is like going to McDonald's—you can get satisfied in much less time. Once people see that you can achieve results very quickly, the culture will change."

Participants Given Latitude when Choosing Target Proteins

Leaders of the structural genomics centers are free to choose their target proteins, but NIGMS requires that they use a genome-directed approach in doing so. The TB Structural Genomics Consortium is the only NIGMS-supported center focused on advancing efforts to treat a major public health threat. "If scientists and pharmaceutical companies had access to the structures of hundreds of TB proteins, they might be able to use them to design new anti-TB drugs more effectively," Terwilliger says, pointing out that the consortium already identified about 1,000 TB proteins that seem especially well suited as anti-TB drug targets.

Several of the other centers focus on microbes as model systems to understand basic structure-function relationships and to develop structural genomics approaches. For instance, members of the Southeast Collaboratory for Structural Genomics plan to determine the structure of every protein in Pyrococcus furiosus, a hyperthermophilic anaerobe first isolated from the hot underwater slopes of a volcano in Italy. By setting their sights on the entire proteome of P. furiosus—rather than tackling only the easy-to-determine protein structures—the group will "explore the full breadth of obstacles to high-throughput structure determination," says its leader Wang. P. furiosus has a relatively small genome containing about 2,200 open reading frames (ORFS). "We feel it is an excellent model prokaryote system for us to use to learn about protein folding, since it may contain all possible protein folds in its small but complete genome," he says.

Sung-Hou Kim, who leads the Berkeley Structural Genomics Center, adopted a similar strategy in choosing to study the "minimal genomes" of Mycoplasma genitalium and Mycoplasma pneumoniae. Using a different approach, Joachimiak and his collaborators at the Midwest Center for Structural Genomics are comparing homologous proteins in dozens of microbes, including pathogenic species such as Streptococcus pneumoniae, Haemophilus influenzae, and Neisseria gonorrhoeae.

"We're interested in what makes an organism unique," Joachimiak says. Specifically, his group is attempting to tease out metabolic pathways that are characteristic of an individual pathogen or class of organisms. "This is of particular importance for drug design," he says, "because if there are critical pathways, and we can inhibit any enzyme in one of these pathways, we can inhibit that bug."

Even in its first flowering, structural genomics is returning to its roots—microbial genetics. Its proponents predict that the field will spur advances in areas ranging from protein chemistry to structure-based drugs design and computational modeling. Although it is too early to foresee its eventual impact, structural genomics promises to deliver a bounty of data-and, very likely, to open up a whole new chapter in biomedical research.

Last Modified:September 14, 2001
Email: webmaster@asmusa.org
Copyright © 2001 American Society for Microbiology All rights reserved ASM
HomeSite Map Search ASM Site