Teaming with Life: SECTION IV

This is historical material, "frozen in time." The web site is no longer updated and links to external web sites and some internal pages will not work.

Cover Introduction Section I Section II Section III Section IV Section V

SECTION IV

Build a "Next Generation"

National Biological Information Infrastructure

"... more and more people realize that information is a treasure that must be shared to be valuable...our Administration will soon propose using ... technology to create a global network of environmental information."

Albert Gore, Jr., 21 March 1994

"With all of everyone's work online, we will have the opportunity ... to let everyone use everyone else's intellectual effort. ... The challenge for librarians and computer scientists is to let us find the information we want in other people's work..."

Michael E. Lesk (1997), http://community.bellcore.com/lesk/ksg97/ksg.html

The economic prosperity and, indeed, the fate of human societies are inextricably linked to the natural world. Because of this, information about biodiversity and ecosystems is vital to a wide range of scientific, educational, commercial, and governmental uses. Unfortunately, most of this information exists in forms that are not easily used. From traditional, paper-based libraries to scattered databases and physical specimens preserved in natural history collections throughout the world, our record of biodiversity and ecosystem resources is uncoordinated and isolated. It is, de facto, inaccessible. There exists no comprehensive technological or organizational framework that allows this information to be readily accessed or used effectively by scientists, resource managers, policy makers, or other potential client communities. "We have ... vast mountains of data that never enter a single human mind as a thought. ... Perhaps this sort of data should be called ‘exformation' instead of information ... " (Albert Gore, Jr. [1993], Earth in the Balance, pp. 200-201).

However, significant increases in computation and communications capabilities in recent years have opened up previously unimagined possibilities in the field of information technology, and these trends will continue for the foreseeable future. It is clear that abundant, easily-accessible, analyzed and synthesized information that can and does "enter the human mind as a thought" will be essential for managing our biodiversity and ecosystem resources. Thus, research and development is needed in order to harness new information technologies that can help turn ecological "exformation" to "information."

We need computer science, library and information science, and communications technology research (hereafter abbreviated as CS/IT) to produce mechanisms that can, for example, efficiently search through terabytes of Mission to Planet Earth satellite data and other biodiversity and ecosystems datasets, make correlations among data from disparate sources, compile those data in new ways, analyze and synthesize them, and present the resulting information in an understandable and usable manner. At present, we are far from being able to perform these actions on any but the most minor scale. However, the technology exists to make very rapid progress in these areas, if the attention of the CS/IT community is focused on the biodiversity and ecosystems information domain.

Focus research on biodiversity and ecosystems information to promote use of that information in management decisions, in education and research, and by the public.

Knowledge about biodiversity and ecosystems, even though incomplete, is a vast and complex information domain. The complexity arises from two sources. The first of these is the underlying biological complexity of the organisms and ecosystems themselves. There are millions of species, each of which is highly variable across individual organisms and populations. These species each have complex chemistries, physiologies, developmental cycles and behaviors, all resulting from more than three billion years of evolution. There are hundreds if not thousands of ecosystems, each comprising complex interactions among large numbers of species, and between those species and multiple abiotic factors.

The second source of complexity in biodiversity and ecosystems information is sociologically generated. The sociological complexity includes problems of communication and coordination—between agencies, between divergent interests, and across groups of people from different regions, different backgrounds (academia, industry, government), and different views and requirements. The kinds of data humans have collected about organisms and their relationships vary in precision, accuracy, and in numerous other ways. Biodiversity data types include not only text and numerical measurements, but also images, sound, and video. The range of other data types with which scientists and other users will want to mesh their biodiversity databases is also very broad: geographical, meteorological, geological, chemical, physical, etc. Further, the manner and mechanisms that have been employed in biodiversity data collection and storage are almost as varied as the natural world the datasets document. Therefore, analysis of the work practices involved in building these datasets is one among several CS/IT research priorities.

All this variability constitutes a unique set of challenges to information management. These challenges greatly exceed those of managing gene or protein sequence data (and that domain is challenging in its own right). In addition to the complexity of the data, the sheer mass of data accumulated by satellite imagery of the Earth (terabytes per year are captured by Landsat alone) presents additional information management challenges. These challenges must be met so that we can exploit what we do know, and expand that knowledge in appropriate and planned directions through research, to increase our ability to live sustainably in this biological world.

Various research activities are being conducted that are increasing our ability to manage biological information:

• The Human Genome project is spawning not only new medical therapies but also developments in CS/IT areas as well.

• Geographic Information Systems (GIS) are expanding the ability of some agencies to conduct their activities more responsibly and making it possible for industry to choose sites for new installations more intelligently.

• The National Spatial Data Infrastructure has contributed to progress in dealing with geographic, geological, and satellite datasets.

• Research conducted as part of the Digital Libraries projects has begun to benefit certain information domains.

• The High-Performance Computing and Communications initiative has greatly benefited certain computation-intensive engineering and science areas.

• All of science has benefited from the Internet; those benefits will increase with the development of the "next generation" Internet, or Internet-2.

But, to date, there has been insufficient attention paid to the need for CS/IT research on biodiversity and ecosystems information.

Given the importance of and need for biodiversity and ecosystems data to be turned into information so that it can be comprehended and applied to achieve a sustainable future, this Panel recommends that the attention of a number of governmental research and research funding activities be directed toward the special needs of biodiversity and ecosystems data:

• The Federal Geographic Data Committee should immediately include biodiversity and ecosystems data in its work to produce standard descriptors for Federal geospatial data.

• The Digital Libraries Initiative of the NSF, DARPA, and NASA should call for research specifically focused on the biodiversity and ecosystems information domain in all future Requests for Proposals. Current Digital Libraries projects are working on some of the techniques needed (automatic indexing, sophisticated mapping, brokering routines, etc.), but the developments are not focused on biodiversity and ecosystems information, which have their own unique characteristics.

• The Knowledge and Distributed Intelligence and the Life in Earth's Environment initiatives of the NSF should call for CS/IT and appropriate associated biological and sociological research specifically focused on the biodiversity and ecosystems information domain in all future Requests for Proposals.

• The NSTC Committee on Technology should focus on the biodiversity and ecosystems information domain within a number of its stated R&D areas, particularly: 1) addressing problems of greater complexity (in the High End Computing and Computation Program Area); 2) advanced network architectures for disseminating environmental data (Large Scale Networking Program Area); 3) extraction and analytical tools for correlating and manipulating distributed information, advanced group authoring tools, and scaleable infrastructures to enable collaborative environments (Human Centered Systems Program Area); and 4) graduate and postdoctoral training and R&D grants (Education, Training and Human Resources Program Area).

The biodiversity and ecosystems information domain is not at present as amenable to correlation, analysis, synthesis and presentation across networks as are other domains because of the problems of complexity pointed out above and because the CS/IT community has, to date, more or less ignored these sorts of data and the associated challenges. A concerted research effort, by government, business, and academia is needed, and needed soon, so that the masses of data and information that are stored in the museums, libraries, and government agencies of this country, and that are generated daily by Mission to Planet Earth and other activities, can be put to good use.

The problem of excess data will get steadily worse if means are not devised to analyze and synthesize those data quickly and effectively to turn them into usable and useful information that can be brought to bear in decision-making, policy formulation, directing future research, and so on. Computers were invented to assist humans in tedious computational tasks, which the conversion of satellite data into useful information surely is. One reason that we have unused data is because we are collecting it while we still do not have efficient means to convert it into comprehensible information. What person could be expected to absorb and "understand" terabytes of satellite data by brainpower alone, without the assistance of computers? The CS/IT research endeavor advocated here will reap great rewards by inventing better means to make the conversion from data to useful information. Much of the talent needed for this work is employed in the private sector, and so public-private partnerships that involve software and hardware designers and biologists will be needed to accomplish the task.

The investments that have been made in acquiring data are large ($1 billion per year on Mission to Planet Earth is only one example). The full potential of those investments will not be realized if new tools for putting the data to use are not devised. Unused data are not worth the initial investment made in gathering them. Failure to develop the technologies to manipulate, combine, and correlate the biodiversity and ecosystems data we have available from all sources will have adverse effects on our ability to predict and prevent degradation of our natural capital.

Federal computing, information, and communications programs invest in critical, long-term R&D that advances computing, information, and communications in the United States. These investments to date have enabled government agencies to fulfill their missions more effectively and better understand and manage the physical environment. They have also contributed much to US economic competitiveness. It is our contention that future investments by the government's computing, information, and communications programs that are overseen by the NSTC Committee on Technology should be concentrated in the area of biodiversity and ecosystems information. As has happened in other areas, this Federal investment will enable agencies to manage the biological environment in better ways, and will very likely spin off new technologies that can be exploited by the private sector to benefit the US economy.

The first of these investments should be made in the next round of competition for research awards. Progress in the development of the needed technologies can be measured by increases in the ability of agencies to utilize data they already have or are now collecting, in the creation of private sector jobs and businesses that are directly related to biodiversity and ecosystems information management, and in research that is more clearly focused because proper data management has illuminated both what is already known and what remains to be discovered.

Design and construct the "next generation"

National Biological Information Infrastructure (NBII-2).

The CS/IT research described above will contribute to progress in managing biodiversity and ecosystems information. The productivity of individual research groups, driven by their own curiosity, ingenuity, and creativity has served this country well in myriad fields of science and the development of technology. Yet, there are important issues in the management and processing of biodiversity and ecosystems information that must be addressed in a much more coordinated and concerted way than has been attempted to date.

The value of raw data is typically predicated on our ability to extract higher-order understanding from those data. Traditionally, humans have done the task of analysis: one or more analysts become familiar with the data and with the help of statistical or other techniques provide summaries and generate results. Scientists, in effect, generate the "correct" queries to ask and even act as sophisticated query processors. Such an approach, however, rapidly breaks down as the volume and dimensionality (depth and complexity) of the data increase. What person could be expected to "understand" millions of cases, each having hundreds of attributes? This is the same question asked about satellite data above—human brainpower requires sophisticated assistance from computers to complete these sorts of tasks. The current National Biological Information Infrastructure (NBII) is in its infancy, and cannot provide the sophisticated services that will enable the simultaneous querying and analysis of multiple, huge datasets. Yet, it will become more and more necessary to manipulate data in this way as good stewardship of biodiversity and ecosystems grows increasingly important.

The overarching goal of the "next generation" National Biological Information Infrastructure, or NBII-2, would be to become, in effect, a fully digitally accessible, distributed, interactive research library system. The NBII-2 would provide an organizing framework from which scientists could extract useful information —new knowledge—from the aggregate mass of information generated by various data gathering activities. It would do this by harnessing the power of computers to do the underlying queries, correlation, and other processing activities that at present require a human mind. It would make analysis and synthesis of vast amounts of data from multiple datasets much more accessible to a variety of users. It would also serve management and policy, education, recreation, and the needs of industry by presenting data to each user in a manner tailored to that user's needs and skill level.

We envision the NBII-2 as a distributed facility that would be something considerably different than a "data center," something considerably more functional than a traditional library, something considerably more encompassing than a typical research institute. It would be all of these things, and at the same time none of them. Unlike a data center, the objective would not be the collection of all datasets on a given topic into one storage facility, but rather the automatic discovery, indexing, and linking of those datasets. Unlike a traditional library, which stores and preserves information in its original form, this special library would not only keep the original form but also update the form of storage and upgrade information content. Unlike a typical research institute, this facility would provide services to research going on elsewhere; its own staff would conduct both CS/IT and biodiversity and ecosystems research; and the facility would offer "library" storage and access to diverse constituencies.

The core of the NBII would be a "research library system" that would comprise five regional nodes, sited at appropriate institutions (national laboratories, universities, etc.) and connected to each other and to the nearest telecommunications providers by the highest bandwidth network available. In addition, the NBII-2 would comprise every desktop PC or minicomputer that stores and serves biodiversity and ecosystems data via the Internet. The providers of information would have complete control over their own data, but at the same time have the opportunity to benefit from (and the right to refuse) the data indexing, cleansing, and long-term storage services of the system as a whole.

• The framework to support knowledge discovery for the nation's biodiversity and ecosystems enterprise that involves many client and potential-client groups;

• A common focus for independent research efforts, and a global, neutral context for sharing information among those efforts;

• An accrete-only, no-delete facility from which all information would be available online, twenty-four hours a day, seven days a week in a variety of formats appropriate to a given user;

• A facility that would serve the needs of (and eventually be supported by partnership among) government, the private sector, education, and individuals;

• An organized framework for collaboration among Federal, regional, state, and local organizations in the public and private sectors that would provide improved programmatic efficiencies and economies of scale through better coordination of efforts;

• A commodity-based infrastructure that utilizes readily available, off-the-shelf hardware and software and the research outputs of the Digital Libraries initiative where possible;

• An electronic facility where scientists could "publish" biodiversity and ecosystems information for cataloging, automatic indexing, access, analysis, and dissemination;

• A place where intensive work is conducted on how people use these large databases, and how they might better use them, including improvement of interface design (human-computer interaction);

• A mechanism for development of organizational and educational infrastructure that will support sharing, use and coordination of these massive data sets;

• A mirroring and/or backup facility that would provide content storage resources, registration of datasets, and "curation" of datasets (including migration, cleansing, indexing, etc.);

• An applied biodiversity and ecosystems informatics research facility that would develop new technologies and offer training in informatics;

• A facility that would provide high end computation and communications to researchers at diverse institutions.

To be effective, the NBII-2 that we propose must be a system designed more for information users than for data providers, although the system would supply services to the latter as well. Research is necessary to better characterize the needs and requirements of different classes of users of digital library systems, and to gain insight into how to adapt systems to specific user needs and behaviors. The linkage of personal and work-group information management systems to a digital library system is an issue of particular importance. A great deal of design research is needed to construct the system, which must be a constantly evolving entity.

This facility would not be a purely technical and technological construct, but rather would also encompass complex sociological, legal, and economic issues in its research purview . These might include intellectual property rights management, public access to the scholarly and cultural record, and the characteristics of evolving systems of scholarly and mass communications in the networked information environment. The human dimensions of the interaction with computers, networks, and information will be a particularly important area of research as systems are designed for the greatest flexibility and usefulness to people.

The needs that the research nodes of the NBII-2 must address are many. A small subset of those needs includes:

• New statistical pattern recognition and modeling techniques that can work with high dimensional, large-volume data;

• Workable data-cleaning methods that automatically correct input and other types of errors in databases;

• Strategies for sampling and selecting data;

• Algorithms for classification, clustering, dependency analysis, and change and deviation detection that scale to large databases;

• Visualization techniques that scale to large and multiple databases;

• Metadata encoding routines that will make data mining meaningful when multiple, distributed sources are searched;

• Methods for improving connectivity of databases, integrating data mining tools, and developing ever better synthetic technologies;

• Ongoing, formative evaluation, detailed user studies, and quick feedback between domain experts, users, developers and researchers.

In order to comprehend and utilize our biodiversity and ecosystem resources, we must learn how to exploit massive data sets, learn how to store and access them for analytic purposes, and develop methods to cope with growth and change in data. The NBII-2 that we recommend here will be an enabling framework that could unlock the knowledge and economic power lying dormant in the masses of biodiversity and ecosystems data that we have on hand.

Box 8: Why do we need an NBII-2?

Biodiversity is complex; ecosystems are complex. The questions we need to ask in order to manage and conserve biodiversity and ecosystems therefore require answers comprised of information from many sources. As described in the text, our current ability to combine data from many sources is not very good, or very rapid—a human being usually has to perform the tasks of correlation, analysis, and synthesis of data drawn painstakingly from individual datasets, one at a time. The NBII of today only has the capability to point a user toward single data sets, one at a time, that might (or might not) contain data that are relevant to the user's question. If the dataset does appear useful, the human must construct a query in a manner structured by the requirements of the particular application that manages the dataset (which likely as not is somewhat arcane). The human must then collate results of this query with those of other queries (which may be very difficult because of differences in structures among datasets), perform the analyses, and prepare the results for presentation. What we need is an organizing framework that will allow that same human being to construct a query in everyday language, and automatically obtain exactly the information needed from all datasets available on the Internet. These data would be automatically filtered, tested for quality, and presented in correlated, combined and analyzed form, ready for the human mind to perform only higher-order interpretation. With tools such as these, we will begin to be able to "mine" the information we already have to generate new insights and understanding. At present, the task of "data mining" in the biodiversity and ecosystems information domain is so tedious as to be unrewarding, despite our very great need for the insights it has the potential to yield.

Box 9: Why do we need an NBII-2? Scenario 1

An agricultural researcher has just isolated and characterized a gene in a species of Chenopodium that enables the plant to tolerate high-salt soil. To find out about other characteristics of the habitat within which that gene evolved, the researcher uses NBII-2 to link to physical data on the habitat (temperature and rainfall regimes, range of soil salinity, acidity, texture and other characteristics, elevation and degree of slope and exposure to sunlight, etc.), biological information about other plants with which this Chenopodium occurs in nature, data on animals that are associated with it, and its phylogenetic relationship to other species of Chenopodium, about which the same details are gathered. Linkages among these ecological and systematic databases and between them and others that contain gene sequence information enable the researcher to determine that the gene she has isolated tolerates a wider range of environmental variables than do equivalents in other species that have been tested (although this analysis also points out additional species that it would be worthwhile to test). The gene from this species is selected as a primary candidate for insertion by transgenic techniques into forage and browse plants to generate strains that will tolerate high-salt soils in regions that currently support sheep and cattle but which are becoming more and more arid (and their soils saltier) because of global climate change.

Box 10: Why do we need an NBII-2? Scenario 2

On an inspection of a watershed area, a resource manager finds larval fish of a type with which he is unfamiliar. Returning to the office, the manager accesses an online fish-identification program. Quickly finding that there are several alien species represented in the sample he took, he then obtains information on the native ranges of these species, their life history characteristics, reproductive requirements and rates, physiological tolerances, ecological preferences, and natural predators and parasites from databases held by natural history museums around the world. He is able to ascertain that only one of the alien species is likely to survive and spread in this particular watershed. Online, he is also able to access data sets that describe measures taken against this species in other resource management areas, and the results of those measures. By asking the system to correlate and combine data on the environmental characteristics of the fish's native range that have been measured by satellite passes for the past 20 years, as well as the environmental characteristics of the other areas into which it had been introduced, he is able to ascertain which of the management strategies is most likely to work in the situation he faces. Not only does the manager obtain this information in a single afternoon, but he is able to put the results to work immediately, before populations of the invading fish species can grow out of control. The form and results of the manager's queries are also stored to enable an even faster response time when the same or a related species is discovered in another watershed.

Box 11: Why do we need an NBII-2? Scenario 3

A community is in conflict over selection of areas for preservation as wild lands in the face of intense pressures for development. The areas available have differing characteristics and differing sets of endangered species that they support. The NBII-2 is used to access information about each area that includes vegetation types, spatial area required to support the species that occur there, optimal habitat for the most endangered species, and the physical parameters of the habitats in each of the areas. In addition, information on the characteristics and needs of each of the species is drawn from natural history museums around the world. Maps of the area are downloaded from the US Geological Survey, and other geographic information data layers are obtained from an archive across the country. Also, the NBII-2 even provides access to software developed in other countries specifically for the purpose of analyzing these multiple data types. The analyses conducted on these datasets using this software provide visually understandable maps of the areas that, if preserved, would conserve the greatest biodiversity, and of those areas that would be less desirable as preserves. Conservation biologists then make information-based predictions about success of species maintenance given differing decisions. On the basis of the sound scientific information and analysis delivered by the NBII-2, the conflict is resolved and the community enjoys the benefits of being stewards of natural capital as well as the benefits of economic growth.

If all the species of the world were discovered, cataloged, and described in books with one specimen per page, they would take up nearly six kilometers of shelving. This is about what you would find in a medium-size public library. The total volume of biodiversity and ecosystems data that exist in this country has not been calculated, probably because it is so extensive as to be extremely difficult to measure.

Of course, the complete record of biodiversity and ecosystems is orders of magnitude greater than this and exists in media types far more complex than paper. Biodiversity and ecosystem information exists in scores of institutional and individual databases and in hundreds of laboratory and personal field journals scattered throughout the country. In addition, the use of satellite data, spatial information, geographic information, simulation, and visualization techniques is proliferating (NASA currently holds at least 36 terabytes of data that are directly relevant to biodiversity and ecosystems), along with an increasing use of two- and three-dimensional images, full-motion video, and sound.

The natural history museums of this country contain at least 750 million specimens that comprise a 150- to 200-year historical record of biodiversity. Some of the information associated with these collections has been translated into electronic form, but most remains to be captured digitally. There are many datasets that have been digitized, but are in outdated formats that need to be ported into newer systems. There are also datasets that are accessible but of questionable, or at least undescribed, quality. There are researchers generating valuable data who do not know how to make those data available to a wide variety of users. And data once available online can still be lost to the community when their originator dies or retires (our society has yet to create a system that will keep data alive and usable once the originator is no longer able to do so). For these reasons, we lose the results of a great deal of biodiversity and ecological research that more than likely cannot be repeated.

Potentially useful and critically important information abounds, but it is virtually impossible to use it in practical ways. The sheer quantity and diversity of information require an organizing framework on a national scale. This national framework must also contribute to the Global Information Infrastructure, by making possible the full and open sharing of information among nations.

The term "data mining" has been used in the database community to describe large-scale, synthetic activities that attempt to derive new knowledge from old information. In fact, data mining is only part of a larger process of knowledge discovery that includes the large-scale, interactive storage of information (known by the unintentionally uninspiring term "data warehousing"), cataloging, cleaning, preprocessing, transformation and reduction of data, as well as the generation and use of models, evaluation and interpretation, and finally consolidation and use of the newly extracted knowledge. Data mining is only one step in an iterative and interactive process that will become ever more critical if we are to derive full benefit from our biodiversity and ecosystems resources.

New approaches, techniques, and solutions must be developed in order to translate data from outmoded media into usable formats, and to enable the analysis of large biodiversity and ecosystems databases. Faced with massive datasets, traditional approaches in database management, statistics, pattern recognition, and visualization collapse. For example, a statistical analysis package assumes that all the data to be analyzed can be loaded into memory and then manipulated. What happens when the dataset does not fit into main memory? What happens if the database is on a remote server and will never permit a naive scan of the data? What happens if queries for stratified samples are impossible because data fields in the database being accessed are not indexed so the appropriate data can be located? What if the database is structured with only sparse relations among tables, or if the dataset can only be accessed through a hierarchical set of fields?

Furthermore, problems often are not restricted to issues of scalability of storage or access. For example, what if a user of a large data repository does not know how to specify the desired query? It is not clear that a Structured Query Language statement (or even a program) can be written to retrieve the information needed to answer such a query as "show me the list of gene sequences for which voucher specimens exist in natural history collections and for which we also know the physiology and ecological associates of those species." Many of the interesting questions that users of biodiversity and ecosystems information would like to ask are of this type; the data needed to answer them must come from multiple sources that will be inherently different in structure. Software applications that provide more natural interfaces between humans and databases than are currently available are also needed. For example, data mining algorithms could be devised that "learn" by matching user-constructed models so that the algorithm would identify and retrieve database records by matching a model rather than a structured query. This would eliminate the current requirement that the user adapt to the machine's needs rather than the other way around.

The major research and infrastructure requirements of the digitally accessible, distributed, interactive, research library are several:

• Networking:

The library will of necessity place extensive and challenging demands on network hardware infrastructure services, as well as those services relating to authentication, integrity, and security, including determining characteristics and rights associated with users. We need both a fuller implementation of current technologies—such as digital signatures and public-key infrastructure for managing cryptographic key distribution—and a consideration of tools and services in a broader context related to library use. For example, the library system may have to identify whether a user is a member of an organization that has some set of access rights to an information resource. As a national and international enterprise that serves a very large range of users, the library must be designed to detect and adapt to variable degrees of connectivity of individual resources that are accessible through networks.

• Computation:

A fully digital, interactive library system requires substantial computational and storage resources both in servers and in a distributed computational environment. Little is known about the precise scope of the necessary resources, and so experimentation will be needed to determine it. Many existing information retrieval techniques are extremely intensive in both their computational and their input-output demands as they evaluate, structure, and compare large databases that exist within a distributed environment. In many areas that are critical to digital libraries, such as knowledge representation and resource description, or summarization and navigation, even the basic algorithms and approaches are not yet well defined, which makes it difficult to project computational requirements. It does appear likely, however, that many operations of digital libraries will be computationally intensive—for example, distributed database searching, resource discovery, automatic classification and summarization, and graphical approaches to presenting large amounts of information—because digital library applications call for the aggregation of large numbers of autonomously managed resources and their presentation to the user as a coherent whole.

• Storage:

Even though the library system we are proposing here would not set out to accrue datasets or to become a repository for all biodiversity data (after all, NASA and other agencies have their own storage facilities, and various data providers will want to retain control over their own data), massive storage capabilities on disc, tape, optical or other future technology (e.g., holography) will still be required. As research is conducted to devise new ways to manipulate huge datasets, such datasets will have to be sought out, copied from their original source, and stored for use in the research. And, in serving its long-term curation function, the library will accumulate substantial amounts of data for which it will be responsible. The nodes will need to mirror datasets (for redundancy to ensure data persistence) of other nodes or other sites, and this function will also require storage capacity.

• Software:

Information management: Major advances are needed in methods for knowledge representation and interchange, database management and federation, navigation, modeling, and data-driven simulation; in effective approaches to describing large complex networked information resources; and in techniques to support networked information discovery and retrieval in extremely large scale distributed systems. In addition to near-term operational solutions, new approaches are also needed to longer-term issues such as the preservation of digital information across generations of storage, processing, and representation technology. Traditional information science skills such as thesaurus construction and complex indexing are currently being transformed by the challenge of making sense of the data on the World Wide Web and other large information sources. We need to preserve and support the knowledge of library and information science researchers, and help scale up the skills of knowledge organization and information retrieval.

Data mining, indexing, statistical and visualization tools: The library system will use as well as develop tools for its various functions. Wherever possible, tools will be adopted and adapted from other arenas, such as defense, intelligence, and industry. A reciprocal relationship among partners in these developments will provide the most rapid progress and best results.

• Research Issues:

Many of the research issues to be taken up by the researchers who work at the virtual library system have been mentioned in the discussion above. Among the most important issues are content-based analysis, data integration, automatic indexing on multiple levels (of content within databases, of content and quality of databases across disciplines and networks, of compilations of data made in the process of research, etc.), and data cleansing. The latter is a process that at present is extremely tedious, time- and human labor-intensive, and inefficient and often ineffectual. Much of the current expenditure on databases is consumed by the salaries of people who do data entry and data verification. Automatic means of carrying out these tasks are a priority if we are to be able to utilize our biodiversity and ecosystems information to protect our natural capital.

We have laid out the case for building a fully digital, interactive, research library system for biodiversity and ecosystems information, and the basic requirements of and goals for the library and its research and service. In the 21st Century, work will be increasingly dependent on rapid, coordinated access to shared information. Through the NBII-2, a shared digital library system, scientists and policy makers will be able to collaborate with colleagues across geographic and temporal distances. They will be able to use these libraries to catalog and organize information, perform analyses, test hypotheses, make decisions, and discover new ideas. Educators will be able to use these systems to read, write, teach, and learn. In traditional fashion, intellectual work will be shared with others through the medium of the library—but these contributions and interactions will be elements of a global and universally accessible library that can be used by many different people and many different communities. By increasing the effectiveness and speed with which information is communicated and used, the NBII-2 is likely to lead to major scientific discoveries, promote interdisciplinary synergism, advance existing areas of study, and enable entirely new areas of inquiry.

As Vice President Gore said, we have excess data that are unused. Yet we have paid substantial sums to collect those data, and, if they are analyzed and synthesized properly, they can contribute much to our understanding of biodiversity and ecosystems. Our national natural capital is too critically important for us to fail to devote the time and energy required to learn to use it sustainably. To develop the means to do that, we need to have knowledge and understanding of biodiversity and ecosystems; to develop that knowledge and understanding, we must mine the data that we have and that we are generating for correlations that will identify pattern and process. We must prevent what Mr. Gore referred to as "data rot" and "information pollution" by putting the data to use. To do that effectively, we must employ the tools and technologies that are making data mining possible. If we do not build the fully digitally accessible, interactive, research library of biodiversity and ecosystems information, we will lose the opportunity to realize the fullest returns on our data-gathering investments and also to optimize returns from our natural capital.

We recommend that an appropriate avenue be found for further planning and implementation of the library system. The planning panel should include knowledgeable individuals from government, the private sector, and academia. It should further develop the interactive research library concept, and design a plan whereby sites would be proposed and chosen and the work carried out. A request for proposals will be needed, and a means of selecting the most meritorious among these. Many government agencies will of necessity be involved in this process, and all should contribute expertise where needed, but we recommend that the NSF take the overall lead in the process, supported by the NSTC Committee on Technology (CIT), with participation from agencies that hold biodiversity and ecosystems information but which are not members of the CIT.

Each of the regional nodes that will form the core of the digitally accessible, interactive, research library system will require an annual operating budget of at least $8 million. Supporting five or six such nodes (the number we regard as adequate to the task) and the high-speed connections among them will therefore require a minimum of $40 million per year, an amount that represents a mere fraction of the funds spent government-wide each year to collect data (conservatively estimated at $500 million)—data that may or may not be used or useful because the techniques and tools to put it to optimal use have yet to be developed. As with the Internet itself, and other computer and information technologies, the Federal government plays a "kickoff" or "jumpstart" role in the institution of a new infrastructure. Gradually, support and operation of that infrastructure should shift to other partners, just as has happened with the Internet, although in this case there will have to be at least a modicum of permanent Federal support (for the maintenance of its own data, for instance).

The planning and request-for-proposals process should be conducted within one year. Merit review and selection of sites should be complete within the following six months. The staffing of the sites and initial coordination of research and outreach activities should take no more than a year after initial funding is provided. The "lifetime" of any one facility should probably not be guaranteed for any more than five years, but the system must be considered a long-term activity, so that data access is guaranteed in perpetuity. Evaluation of the sites and of the system should be regular and rigorous, although the milestones whereby success can be measured will be the incremental improvements in ease of use of the system by policy-makers, scientists, householders and even shool children. In addition, an increasing number of public-private partnerships that fund the research and other operations will indicate the usefulness of accessible, integrated information to commercial as well as governmental concerns.

Cover Introduction Section I Section II Section III Section IV Section V

Office of Science and Technology Policy 1600 Pennsylvania Ave,N.W Washington, DC 20502 202.395.7347 Information@ostp.eop.gov

To comment on this service, send feedback to the Web Development Team.