ASM News
ASM Home Site Map Search ASM Site

    !animalc.gif (410 bytes)

    Peter D. Karp is Director of the Bioinformatics Research Group, SRI International, Menlo Park, Calif. 

    Links to Other ASM Pages:

Many Microbial Genomes in Genbank Do Not Conform to the Genbank Standard

Some genome sequencing groups do not follow the Genbank standard when submitting data, and Genbank staff are not enforcing the standard

Peter D. Karp

Many of us take for granted the fact that every day we plug our computers into the Internet to surf the World Wide Web, and exchange e-mail and data with hundreds of other computer users using hardware and software developed by dozens of different manufacturers. The Internet is based on literally dozens of networking communication standards that define conventions for data exchange. For example, one set of standards defines the manner in which a stream of data (such as a Web page) is broken into smaller data packets that can be routed through different Internet links on their way to their final destination, where they are assembled back into the original Web page. Another standard defines the Internet e-mail protocol, which in essence requires the sending e-mail program to first communicate the e-mail address of the sender, and then the addresses of the recipients, and then the message body. The exchange of Web pages is governed by another standard, and the HTML language for encoding Web pages is defined in yet another standard.

The Internet works because thousands of hardware and software manufacturers have labored diligently to define these standards, and to follow them precisely. For example, while I was a computer-science graduate student in the 1980s, I spent some of my spare time (much to the chagrin of my advisors) developing an e-mail system called Pony Express (Karp and Kashtan, 1988) for the VAX/VMS computer system. Pony Express could speak the multiple e-mail protocols that existed in those early days when the Internet was competing with other networking standards, and could serve as an "e-mail switching station" that relayed an e-mail message from one networking world to another. If Pony Express failed to follow the e-mail-exchange protocols exactly, e-mail messages would not be accepted at their destination, and I would receive irate complaints from other e-mail software developers that my software was not following the rules. Later I understood those complaints when I spent many hours discovering that certain crashes in Pony Express were caused by receipt of e-mails from other people's software that was not following the rules.

Nonconformant Entries in Genbank

Fine attention to detail is required for successful exchange of data, and interoperation of software, just as it is required for successful wet-lab experimentation. Unfortunately, my recent inspection of complete Genbank records for microbes and higher organisms reveals that genome sequencing groups are not following the Genbank standard faithfully, and the Genbank staff are not enforcing the standard that they have defined. These deviations from the standard are significant and widespread, and virtually every sequencing group has found its own unique way to violate the standard. Figure 1 summarizes excerpts from microbial Genbank entries that violate the Genbank standard.

Consider how violation of the Genbank standard will impede software that attempts to compute with Genbank entries to perform comparative analyses of microbial genomes. Imagine an investigator who wants to study the occurrence of the biosynthetic pathway for tryptophan across all sequenced bacteria. This investigator might ask a software program to search all bacterial genomes for the gene product "tryptophan synthetase." The software will expect to find the name of the gene product in the Genbank /product qualifier. If a genome-sequencing center fails to record the name of a predicted gene product in the Genbank entry, or fails to put it in the /product qualifier, or adds additional text that prevents exact matching of the enzyme name (e.g., "putative tryptophan synthase"), the program will not find the enzyme in that Genbank record, and an erroneous scientific conclusion about the distribution of this enzyme will result. Yes, a scientist can manually search other parts of the Genbank record for the enzyme name, or for its corresponding genes, to circumvent this problem, but imagine performing manual searches across a hundred genomes, for all the enzymes in every amino-acid biosynthesis pathway. The standards are meant to eliminate the need for such arduous searches.

I suggest the following approaches to decrease the number of nonstandard entries in Genbank:

  • The maintainers of the Genbank/EMBL/DDBJ databases should implement programs that perform basic syntactic checking of newly submitted entries. Furthermore, each Genbank entry for a complete genome should be subjected to fifteen minutes of manual checking by Genbank staff because many of these problems can easily be detected in that amount of time.
  • Nonconformant entries should not be accepted in the database, period. Although this approach might delay submission of new entries to the databases for a few weeks while submitters fix the software that they use to generate Genbank entries, it is a small price to pay for interoperability in the long run.
  • Authors who have submitted nonconformant Genbank entries in the past should revise those entries to conform to the standard.

ACKNOWLEDGMENT

This work was sponsored by grant 1-R01-RR07861-01 from the National Institutes of Health.

REFERENCES

Karp, P. D., and D. L. Kashtan. 1988. The Pony Express Network Mail Delivery System and the MM-32 Mail Manager. Presented at the Digital Equipment Computer Users Society Conference, Los Angeles.

Last Modified: October 12, 2001
Email: webmaster@asmusa.org
Copyright © 2001 American Society for Microbiology All rights reserved ASM
HomeSite Map Search ASM Site