Many Microbial Genomes in Genbank Do Not Conform to
the Genbank Standard
Some genome sequencing groups do not follow the Genbank standard
when submitting data, and Genbank staff are not enforcing the standard
Peter D. Karp
Many of us take for granted the fact that every day we plug our
computers into the Internet to surf the World Wide Web, and exchange
e-mail and data with hundreds of other computer users using hardware and
software developed by dozens of different manufacturers. The Internet is
based on literally dozens of networking communication standards that
define conventions for data exchange. For example, one set of standards
defines the manner in which a stream of data (such as a Web page) is
broken into smaller data packets that can be routed through different
Internet links on their way to their final destination, where they are
assembled back into the original Web page. Another standard defines the
Internet e-mail protocol, which in essence requires the sending e-mail
program to first communicate the e-mail address of the sender, and then
the addresses of the recipients, and then the message body. The exchange
of Web pages is governed by another standard, and the HTML language for
encoding Web pages is defined in yet another standard.
The Internet works because thousands of hardware and software
manufacturers have labored diligently to define these standards, and to
follow them precisely. For example, while I was a computer-science
graduate student in the 1980s, I spent some of my spare time (much to
the chagrin of my advisors) developing an e-mail system called Pony
Express (Karp and Kashtan, 1988) for the VAX/VMS computer system. Pony
Express could speak the multiple e-mail protocols that existed in those
early days when the Internet was competing with other networking
standards, and could serve as an "e-mail switching station"
that relayed an e-mail message from one networking world to another. If
Pony Express failed to follow the e-mail-exchange protocols exactly,
e-mail messages would not be accepted at their destination, and I would
receive irate complaints from other e-mail software developers that my
software was not following the rules. Later I understood those
complaints when I spent many hours discovering that certain crashes in
Pony Express were caused by receipt of e-mails from other people's
software that was not following the rules.
Nonconformant Entries in Genbank
Fine attention to detail is required for successful exchange of data,
and interoperation of software, just as it is required for successful
wet-lab experimentation. Unfortunately, my recent inspection of complete
Genbank records for microbes and higher organisms reveals that genome
sequencing groups are not following the Genbank standard faithfully, and
the Genbank staff are not enforcing the standard that they have defined.
These deviations from the standard are significant and widespread, and
virtually every sequencing group has found its own unique way to violate
the standard. Figure 1 summarizes excerpts from microbial Genbank
entries that violate the Genbank standard.
Consider how violation of the Genbank standard will impede software
that attempts to compute with Genbank entries to perform comparative
analyses of microbial genomes. Imagine an investigator who wants to
study the occurrence of the biosynthetic pathway for tryptophan across
all sequenced bacteria. This investigator might ask a software program
to search all bacterial genomes for the gene product "tryptophan
synthetase." The software will expect to find the name of the gene
product in the Genbank /product qualifier. If a genome-sequencing
center fails to record the name of a predicted gene product in the
Genbank entry, or fails to put it in the /product qualifier, or
adds additional text that prevents exact matching of the enzyme name
(e.g., "putative tryptophan synthase"), the program will not
find the enzyme in that Genbank record, and an erroneous scientific
conclusion about the distribution of this enzyme will result. Yes, a
scientist can manually search other parts of the Genbank record for the
enzyme name, or for its corresponding genes, to circumvent this problem,
but imagine performing manual searches across a hundred genomes, for all
the enzymes in every amino-acid biosynthesis pathway. The standards are
meant to eliminate the need for such arduous searches.
I suggest the following approaches to decrease the number of
nonstandard entries in Genbank:
- The maintainers of the Genbank/EMBL/DDBJ databases should
implement programs that perform basic syntactic checking of newly
submitted entries. Furthermore, each Genbank entry for a complete
genome should be subjected to fifteen minutes of manual checking by
Genbank staff because many of these problems can easily be detected
in that amount of time.
- Nonconformant entries should not be accepted in the database,
period. Although this approach might delay submission of new entries
to the databases for a few weeks while submitters fix the software
that they use to generate Genbank entries, it is a small price to
pay for interoperability in the long run.
- Authors who have submitted nonconformant Genbank entries in the
past should revise those entries to conform to the standard.
ACKNOWLEDGMENT
This work was sponsored by grant 1-R01-RR07861-01 from
the National Institutes of Health.
REFERENCES
Karp, P. D., and D. L. Kashtan.
1988. The Pony Express Network Mail Delivery System and the MM-32 Mail
Manager. Presented at the Digital Equipment Computer Users Society
Conference, Los Angeles.