The Complete E. coli Genome Sequence

The Complete E. coli Genome Sequence

The reference genomic DNA sequence currently represented in EcoGene is that of E. coli K-12 strain MG1655, completely sequenced by Blattner et al., 1997. 243 MG1655/W3110 genome sequence discrepancies have been resolved by Hayashi et al., 2006. These corrections, and additional corrections made at UW, were incorporated into the U00096.2 genome sequence update. Eight point mutation differences, the rrnE/rrnD inversion, and 13 strain-specific IS element insertion points are the only differences now remaining between the two E. coli K-12 sequenced genomes, strains MG1655 and W3110. The largest single correction was restoration of a 374 bp deletion that occurred after bp 3192853 in U00096.1. A fifth QUAD repeat sRNA gene, rygE, was missing in U00096.1 and has now been added to EcoGene.

Additional information about this and other E. coli sequencing projects is available from the E. coli Genome Center at the University of Wisconsin, Madison, Wisconsin. The DNA sequence and base pair coordinates are those of version M56 taken from the complete sequence Genbank entry U00096.2. This 4,639,675 bp sequence is used in EcoGene; the genomic address coordinates have been updated from U00096.1 as of EcoGene19.

A recent high throughput resequencing of W3110 found most of the previous differences with MG1655, but also noticed additional sequence differences attributed to additional mutations arising in W3110:Herring and Palsson, 2007.

The CDS intervals and protein sequences in Genbank U00096.2, ftp version have been previously updated as part of the EcoGene project Rudd, K.E., 2000. These largely consist of new functions and gene names taken from a daily literature survey, the identification of probable pseudogenes, and more than 700 translation start site revisions (ALT_INIT) contributed by EcoGene, over 100 of which are based on N-terminal sequencing data in the Verified Set (a data validation, literature scanning EcoGene project). These EcoGene-originated revisions, along with many other types of contributions from the participants at an NIH-sponsored workshop group of E. coli database biocurators, were consolidated into a consensus snapshot annotation Riley, M. et al., 2006 (note: the EcoGene-derived translation start site updates are in Genbank U00096.2 and acknowledged in U00096.2 Reference 7 as "Rudd,K.E., A manual approach to accurate translation start site annotation: an E. coli K-12 case study, unpublished". A few revisions were made to the start sites during the Riley Workshop

This sample demonstration at Woods Hole (specifically by K. Rudd to G. Plunkett, H. Mori and T. Horiuchi) of EcoGene manual start site reannotation led directly to the acceptance of the entire EcoGene-derived start site revision dataset into U00096.2. It has been a longstanding goal of EcoGene to get these results of almost 10 years of EcoGene manual reannoation efforts into the U00096 Genbank records, and this has now been accomplished, for the benefit of the E. coli community. Hopefully these improved start site annotations will now be used to redesign various genome-wide resources built upon the U00096.1 annotations. Recently, (April, 2007) EcoGene has become the source database for Genbank U00096, which will be updated monthly.

Although the Riley Workshop group decided not to continue to update the snapshot records as a group, members of the group continue to collaborate and share information in an effort to keep genome annotation consistent and unified among the participating databases. EcoGene is dedicated to a daily annotation updating process, and the current EcoGene website reflects that unified annotation progress. The gene descriptions in EcoGene continue to be an ongoing, and nearly completed, EcoGene-specific updating of the descriptions inherited from the CGSC database. The EcoGene updates of gene names, functions and phenotypes are based on reading and interpreting original and review E. coli publications. All of the Riley Workshop participants became aware of some minor errors and omissions in their datasets as a result of the Woods Hole "snapshot" data reconciliation project, including EcoGene, and we are very grateful for that opportunity to improve EcoGene.

Usage of any EcoGene sequence datasets should be acknowledged by citing Genbank record U00096.2 and Blattner et al., 1997, in addition to EcoGene (this website URL=www.ecogene.org) and the citation Rudd, K.E., 2000.