The Impact of Electronic Publishing on the Academic Community
Session 6: Access to scientific data repositories
Electronic databases and the scientific record
The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, U.K.
©Graham Cameron, 1997.
In the discussion of electronic publishing much attention typically focuses on the activities of conventional publishers and the move towards electronic mechanisms in the publishing process. A different viewpoint comes from the role of electronic archives of scientific information, which were originally developed fundamentally as databases but have become part of the basic scientific record. We now see something of a convergence of two different kinds of activities. Publication is becoming more database-like. In the electronic era the structure of published information and the tool-set to exploit it invokes more and more sophisticated information technology. At the same time databases are starting to become, in some senses, more like publications. As the shared repositories of scientific information become part of the under-pinning infrastructure of science, they play an increasing role in the scientific record. I want to discuss here the role of the databases in the revolution of the scientific record in the electronic era.
The origin of scientific databases
Scientific databases have various origins, but typically were never conceived actually as databases. They turned into databases almost by accident. In the first instance people used computers to compute on scientific data --- to help them do their sums. Soon, of course, repetitive tasks led to reusable programs and those programs processed data in 'standard formats'. Scientists working in particular domains developed increasing collections of information in the so-called 'standard format' of their programs and began to write more and more software to process data in these formats. Soon the information that was stored this way began to be referred to as a database.
Another thrust came from the kind of experiments enabled by electronic and technological developments. Experimentation which collects huge amount of data has now become possible and automatic data-capture tools in many areas of science have resulted in enormous repositories of data which would have been inconceivable without electronic tools. Systems were built to capture and store those data, again typically in some kind of 'standard format', and the resulting pool of information soon became referred to as a database. Many of the resulting so-called databases would cause any database designer to throw up their hands in horror, for they never were designed, they simply were an ad hoc method of capturing a lot of scientific information.
The European Bioinformatics Institute
The European Bioinformatics Institute (EBI) grew out of such a data collection effort. Many of you will know that, for a couple of decades, it has been possible to determine DNA sequences --- the exact sequence of the base-pairs of the genetic information in the cells of organisms. Indeed, in recent years methodological advances have turned this into a very large-scale data collection effort.
As long ago as 1980 the European Molecular Biology Laboratory (EMBL) established its Nucleotide Sequence Data Library, with the goal of collecting all such information. This information had hitherto been published in the pages of scientific journals, and the goal of the Data Library was to build a scientific database to incorporate, organize and distribute it.
Time has moved on since then, and biology has become information-intensive in many different areas. Current methods, and emerging methods to determine the sequences of bio-macromolecules (DNA, RNA and protein), molecular structures and the biological function of those molecules, have all generated huge amounts of information. In response to this, in December 1992, the EMBL decided to establish the EBI, and this decision was gradually implemented through to 1995 including relocating the operation to the U.K. The EBI has an institutional mission to provide public domain information services for molecular biological and biotechnological research. It incorporates and extends the mandate of the original EMBL Data Library. It is worth noting that the EBI is in fact one of five locations at which EMBL operates, with its headquarters being in Heidelberg, Germany.
The domain of the EBI is largely in the area of macromolecular information, information about the large molecules that are important in biological functions. The original motivation for the establishment of the group in Heidelberg and indeed still the single biggest project, is the Nucleotide Sequence Database. The EBI is now deeply involved in the SWISS-PROT Protein Sequence Database, and, in collaboration with a United States group, a protein structure database, as well as various aspects of documenting protein function. The core services of the EBI hinge around databases in these areas, which are made available in various modalities, with, of course, the World Wide Web access being the preferred method of using the databases. The core collections are doubling in size in less than two years and the growth is, if anything, accelerating. The databases acquire a new sequence once every minute 24 hours a day, 365 days a year. They represent information from 20\000 or so organisms, with about 40% of the data being human information.
These activities are conducted in a long-established international collaboration with the groups in the U.S.A. and Japan which ensures that data entering the collection in Japan, the U.S.A. or Europe are exchanged on a daily basis. The user community is broad and diverse with tens of thousands of accesses to the databases each day from areas such as basic biological research, biotechnology, pharmaceutical research, medicine and agriculture, both in the academic and the commercial sectors.
Much of the usage is analogous to the use of bibliographic databases; users simply look up to see what has been sequenced and what is in the database. But unique from, or at least different from, typical database usage is the kind of computation that users of the database do on the entire collection. For example, it is commonplace for people who have the genetic sequence of a particular gene to search that against the entire database to look for other sequences that are biologically similar. Indeed, the business of deciding what constitutes biological similarity is a research topic in its own right.
Perhaps the single biggest computational problem studied with the combination of databases available from the EBI is the relationship between sequence, structure and function. DNA sequences code for protein sequences which in turn are responsible for the determining the final three-dimensional structure of the proteins involved. It is that three-dimensional structure which is to a large part responsible for the function of the proteins. However, whilst it is accepted that sequence and structure are deterministically linked, their relationship is still poorly understood and an enormous amount of research is carried out in investigating that relationship.
Another important use of the database is in the area of molecular evolution. The understanding of the relationship between different organisms in the evolutionary process is now best studied by looking at the differences between the DNA sequences of those organisms.
Support of information providers
In recognition of the importance of such information sources, the EBI is a publicly funded information provider. (As indeed are our analogous organizations in the U.S.A. and Japan.) There are commercial suppliers in the same domain. The existence of the organizations like the EBI reflects the acceptance that the understanding of living systems is utterly dependent on this shared pool of information. Indeed, society in general depends on many such information sources, and it is my belief that if we regard this as so crucially important, greater attention should be given to the overall principles of information provision.
Publications and databases
Let us compare for a moment conventional publication to the information in databases and collections such as those we maintain at the EBI. My view of the goals of conventional publication is, I think, consistent with that of other contributors to this volume. Its goals are (at least): (i) communication; (ii) creation of a scientific archive; (iii) creation of a citable record; (iv) establishment of scientific priority; and (v) ensuring appropriate credit for work.
Databases like ours were established with similar but not identical motivation. The goal of communicating scientific findings was clear, but they also wished to enable scientists to compute on that information and to re-use scientific data for purposes other than that for which they had originally been gathered. In many cases an explicit goal was to provide information ancillary to conventional publications. However, more recently, databases have begun to stray into what might be seen as conventional publishing territory. In molecular biology: (i) people now cite databases; (ii) they have come to be seen as a part of the archival record of science; (iii) patent lawyers have started to use them to establish scientific priority; and (iv) some United States funding agencies explicitly give scientific credit on the basis of 'database submissions'.
The role of the databases by comparison with traditional scientific record is worryingly ill-defined. The traditional scientific record is, at its best: (i) high quality; (ii) permanent; (iii) citable; and (iv) accessible.
There are procedures and practices which have evolved over hundreds of years to ensure that this is the case. Databases raise new issues: (i) databases can be updated --- it is hard to determine where the definitive version of a particular piece of information is; (ii) the history of information in the database, what was included and when, is often hard to determine; (iii) often they are made available by network sources that may come and go with the enthusiasm of the group that supports them; and (iv) the procedures of quality control are typically quite ill-defined.
Indeed there is a difference of motivation between the archive of the scientific record in traditional publication and that of a database. The traditional scientific archive values permanence, citability and immutability. All these are seen as necessary to enable us to trace scientific activity. Databases often have a completely different motivation. The goal is to present 'today's best bet' at what is the scientific truth. They correct errors when they are discovered, they delete wrong or superseded information, they add new information as it becomes available. This all designed to make them as useful as possible, but it makes them difficult to cite, lacking in permanence, hard to trace.
An easy conclusion would be that we should simply archive all versions of everything that ever appeared in the database. Sadly, it is not as straightforward as it seems. Databases are subject to so many updates that this is typically unfeasible. Our nucleotide sequence database, for example, changes tens of thousands of times every single day. Also, databases in today's networked or 'webbed' world don't stand alone. They often refer to electronic external authorities. For example, to get nomenclature for legumes we might go to ILDIS (international legume database and information service), maintained at Southampton University. Often the most up-to-date information can nowadays be got 'on the fly' as the user accesses your database.
Databases are dynamic. Users are interacting with them while they are changing, and, even if you are logging all the changes to your own information, you may be using external resources whose changes you cannot detect.
Another problem, pronounced in the biosequence databases, is what can be referred to as derived or secondary knowledge. As the databases become tools in our research, the meaning of new information in the database is often determined by analogy to existing information. When conclusions determined by analogy are added back to the pool of information the mixture can get us into trouble. It may then be used to build new analogies, thus creating a spurious impression of a large knowledge base, when in fact the raw knowledge is very sparse indeed. I often comment that if you feed the databases on their own offal you end up with electronic BSE.
In terms of trying to ensure the robustness of scientific record, the anarchy of today's networks creates rather than solves problems. Anyone can mount a Web site. Sources come and go and it is hard to determine what is behind a home page, whether it will be permanent and what its quality is. Even if we can determine quality, it is near impossible to locate resources of interest among all the useless information.
Sadly, I am afraid this discussion raises more problems than it solves. In facing up to the electronic era, I argue that the undisciplined use of electronic media will be at least as damaging as the undisciplined use of conventional publication. However I do believe that the optimal exploitation of electronic media will create new opportunities. Opportunities which will not diminish the cost of information provision but can enormously enhance the utility gained by exploiting that information. Recycling of data, using it for purposes other than that for which it was gathered, data-mining to find new patterns of information, can all yield novel insights. I feel that we can only capitalize on that opportunity offered by the electronic era if we behave in a disciplined manner. The issues are not technical, they are rather those of conventions, protocol and establishment of good practice in dealing with electronic information.
I am also convinced that the concerns of publishers about the electronic medium destroying their market are unrealistic. The economic activity in electronic information provision will surely be as great as that in conventional publishing, but it will require that the players are prepared to re-tool to deal with the new medium, technology and mind---set. In this future commercial organizations such as publishers along side publicly funded organizations such as the EBI have a major and exciting role to play.
©Graham Cameron, 1997.
Charles Darwin House
12 Roger Street
Tel: +44 (0)20 7685 2425
Fax: +44 (0)20 7685 2468
Portland Press Ltd.
Charles Darwin House
12 Roger Street
London WC1N 2JU
Tel: +44(0) 20 7685 2410
Fax: +44(0) 20 7685 2469