This paper was adapted from one prepared for the AAAS-ICSU Press-UNESCO Workshop on Developing Practices and Standards for Electronic Publishing in Science, Paris, October 12-14, 1998.

Science is a cumulative activity in which published works make up the stock of scientific ideas. The timespan and spatial reach of science transcends the work of the individual researcher, who digs into the past for inspiration and understanding, and in his turn lays another course of knowledge for those who will follow. That is the nature of scientific inquiry: to relate and position ideas, insights, and data.

In his classic work Science in History[1], J. D. Bernal describes that cumulative tradition of science:

"The methods of the scientist would be of little avail if he had not at his disposal an immense stock of previous knowledge and experience. None of it probably is quite correct, but it is sufficiently so for the active scientist to have advanced points of departure for the work of the future. Science is an ever-growing body of knowledge built of sequences of the reflections and ideas, but even more of the experience and actions, of a great stream of thinkers and workers."

Containers of Ideas and Scientific Contributions

The stock of scientific knowledge and experience is stored in books and journals — in scholarly writing. In the fields of science, technology, and medicine, the form of that writing has more and more been fixed on the journal; monographs have become much less important for scientific communication. The journal article has become the main channel:

"[The scientific article] is the object around which the whole fabric of writing, publishing, and reading is centered. Scientific articles are the prime representation forms for scientific information; they are closed entities, easily portable and well structured as a result of a century long tradition of scientific publishing." [2]

The scientific journal is the mainstay of that institutionalized communication system, and to it has been assigned the following functions:

  • to be a physical distribution package for articles,
  • to be a physical access point in the storage, i.e., the libraries' shelves,
  • to be a quality indicator (of varying level depending on reputation of the journal),
  • to provide a formalized means of collecting payments through subscription fees, and
  • to give an overall indication of the subject matter of articles.

In that communication system libraries play a central role as a "store-and-forward" node. Although libraries are tied to the book and consider journals to be fragmented books, with the growth in importance of the scientific article as the fundamental unit for communicating scientific contributions, librarians had to treat articles individually. They created pointers to library articles by working on what they called "a deeper analytical level." However, the citation system they used did not require the full bibliographic description of the journal, and another level of references — abbreviated for space and speed — was developed for journal citations, based partly on cataloging rules.

Those citations link ideas by linking the containers, i.e. the articles or works. Citations can be used to trace ideas or facts, and they can be used when searching for associated ideas (by citation or bibliographic couplings).

That entire system is being challenged by electronic journals. The global, easily accessible telecommunications network allows scholars to exchange ideas and review each others' work electronically. As the network grows, so does electronic publishing, reaching more people faster. That growth diminishes the role of the paper-bound article.

Moreover, the growth of electronic publishing calls into question the functional justifications for the printed scientific journal: The physical packaging and the physical access in the shelves are clearly not needed for electronic information; the quality does not need print, and subject classification does not rely on paper. Economics is not the issue, either. Although payment schemes will have to be revised to be reasonable in an electronic environment, no one believes that such a revision will not come about.

As many have pointed out, scientific contributions ("articles") in electronic form do not have to be, and should not be, bound by the print tradition. As we move from the traditional domain of print, the scientific and library communities must define the new (electronic) units of scientific contributions.

Cataloging Electronic Entities

The traditional cataloging rules are not adequate for describing electronic entities. As Shadle writes, "With the use of the Internet as a supplement to, or replacement for, the traditional publishing process, the flaws of a cataloging code developed for a different information environment are becoming more apparent."[3]

Concepts such as "issue," "published," and "pages" will change their meanings when print becomes electronic. We don't see the shift yet because much scientific electronic publishing is still an electronic version of print, which obfuscates the change, but it will be apparent with time.

As Shadle concludes: "Many libraries have, with varying degrees of success, integrated electronic journals into their catalogs and Web catalogs are providing direct access to these journals. . . . . Changes in the cataloging code will be required in order for the cataloging community to better integrate electronic journals into library catalogs and collections." He points to a more detailed discussion of these issues in a paper by Jean Hirons and Crystal Graham, "Issues Related to Seriality" (which has been the basis for a report to the AACR Joint Steering Committee: http://www.nlc-bnc.ca/jsc/ser-rep0.html.

Moreover, the new citation will need information we do not now provide in standard citations. For traditional material the usage rights are agreed upon at the time of purchase, or are already settled by agreements or legislation (e.g., the right to circulate a book to the public). With digital objects, which can contain "multimedia" sections such as video and sound, the copyright situation is very complex and various restrictions can be imposed on individual documents (or even parts thereof). We need new dimensions of description: Intellectual Property Rights and technical dimensions of the documents. They will contain not only requirements for licenses, but also information about the hardware and software for reading and viewing the entities.

"The URL is not acceptable as a reference point since it is not stable"

Some organizations, AACR among them, are trying to update traditional cataloging rules to incorporate electronic publications. Others are trying to define schemes to describe electronic entities without reference to older methods. For some, the term "cataloging" has been replaced by "metadata" description [4]. Metadata, as the term suggests, is data about data, just as a catalog record is data about a book. One metadata scheme is the "Dublin Core," which is sponsored and promoted by OCLC, and which has achieved considerable acceptance. (For more information about the Dublin Core, see http://dublincore.org/)

Beyond URLS

To make linking possible it is necessary to have something to point to; therefore the identification of entities is crucial. Preferably there should be a unique identifier for every electronic document.

The enormous growth, and dominance, of the Internet and in particular the World Wide Web has made the Web address, the URL, almost synonymous with an identifier of an entity (or document). However, the URL is not acceptable as a reference point since it is not stable. Reorganizations and updates of local computer systems often lead to changes in the URLs. And since the Web is just one step away from desktop publishing, there will be material from publishers that can (and do) disappear as fast as they appear; there is no guarantee that today's Web site will be around tomorrow.

There are several international efforts going on to address the problem of the impermanence of links and of unique identifiers:

  • PURLs:

    "As a stopgap measure to address some of the problems with the persistence of URLs, about two years ago OCLC deployed a system called the PURL (Persistent URL). Basically, PURLs are HTTP URLs where the usual hostname has been replaced with the host 'PURL.ORG' and the filename is an identifier for the 'real' content being referenced. The PURL.ORG host will be maintained for the long term by OCLC under that name."[5] More information about PURLs can be found at http://www.purl.org.

  • URNs:

    Other approaches to establish a scheme to assign identifiers to electronic entities focus on the logical content, and not the physical location: there is the Uniform Resource Name, URN, which is being developed by the Internet Engineering Task Force, IETF. Documents from the working group on URN can be found at http://www.ietf.cnri.reston.va.us/html.charters/urn-charter.html

  • DOIs:

    A scheme which has received both attention and a beginning adoption is the Digital Object Identifier, DOI, which is both an identifier and a routing system.[6] The home page for the DOI system is at http://www.doi.org.

  • Other naming schemes:

    There are also a number of initiatives from the publishing and media industries: The Serial Item and Contribution Identifier (SICI); The Book Item and Component Identifier (BICI); The Publisher Item Identifier (PII); The Common Information System (CIS) and the International Standard Work Code (ISWC). Green and Bide provide an overview at http://www.bic.org.uk/bic/uniquid.

Implosion and Explosion of the Electronic Scientific Article

The concept of the scientific article in the digital realm is subject to two fundamental changes: One is implosion, a burst inward, and can be discussed in terms of the granularity of scientific contributions; the other is explosion and can be discussed in terms of scattering.

Kircz discusses granularity starting with the traditional article, which is meant to be a "stand-alone" contribution, with content imported from cited sources; in the digital realm such imports are not necessary, and can be replaced by hyperlinks. An article might then start with a few links instead of a "stage-setting" quote, as I have done in this article. Scientific writing is often repetitious; with an increased modularity in the contributions we might get a different anatomy of articles. In the digital realm an article's focus can be more on new content and less on the composition and design of a treatise.

In an article with Harmsze, Kircz identifies three categories of information in scientific publishing: microscopic information, mesoscopic information, and macroscopic information. The first is specific for the work presented; the mesoscopic information is shared by a series of articles within a research program, such as a description of instrumentation; and macroscopic information plays a role in a wider scientific context, such as theoretical foundations.

"This first division of information proves very useful, as a lot of the repetition of information can now be avoided. If something is already described in article one, in article two we only have to refer to that information plus a possible addition of how some aspects have changed. Of course, this demands that the original information is written in such a form that reuse is possible. This emphasizes the point that a modular model is not aimed at a simple recasting of existing articles, but at writing new articles in a fully electronic context." [7]

The explosion of the article comes from the technical basis for producing electronic documents. In printed material, information from different technologies — such as text processors and photographic systems — is brought together in the printed product. With electronic information it is usually more convenient to let the different parts reside in their respective technologies and link them together when presented. The explosion of the article, then, leads to scattering of contents due to "multimediality." The convergence of digital technology makes it natural to combine, by linking, different types of information such as text, moving images, and sound. The move towards electronic documents has been strong in science, technology, and medicine, but the multimediality might well become even stronger in the social sciences, since social phenomena can, in general, better be described by using moving images and sound.

"As electronic access to collections becomes easier and more widespread, users will want a simple, unified access system"

Multimedia documents that are "born digital" naturally appear as parts that are linked together, since the parts are usually produced by different tools, or are collected from different sources or data-capture devices.

Electronic Links

With the emergence of electronic documents there have been adaptations of the rules and formats of citations. Those new citation rules are modifications of their counterparts for traditional (print) media. There is an ISO standard: ISO 690-2, "Information and documentation - Bibliographic references - Part 2: Electronic documents or arts thereof." (http://www.nlc-bnc. ca/iso/tc46sc9/standard/690-2e.htm), and the established APA and MLA formats have also been adapted to electronic documents (see, for example http://english.ttu.edu/kairos/1.2/inbox/mla_archive.html and http://www.apa.org/journals/webref.html).

The two changes in the anatomy of the scientific article (granularity and multimediality) make it necessary, however, to consider the identification question more broadly.

Green and Bide claim that "there are many things which the publishing industry can profitably learn from the unique identification schemes which the international music industry is adopting and much to be gained from working to develop at the least a compatibility of approach to the same or similar problems."

They discuss to what level of detail content has to be identified:

"The ISBN identifies the whole book; the SICI identifies the journal issue and, appropriately extended, the individual article within the issue. This may be enough for some uses but is clearly inadequate for others. If we are to be able to identify all rights owners in a particular piece of content, that may require a far finer degree of granularity of identification, to the level of the individual illustration or quotation from another source. Similarly, if information is to be traded with customers at a level of granularity finer than the 'chapter' or the 'article', then publishers may have compelling marketing reasons for being able properly to identify and to keep track of what is being traded."

I advocated a broader approach in "Digital library work: Meeting user needs,"[8] to bring in the aspect of, and the need for, convergence with the archival sector and with records management. As electronic access to collections, both in libraries and archives, becomes easier and more widespread, users will want a simple, unified access system. That means there is a need for convergence between cataloging rules and archival description as well as coordination of classification systems.

The major problem areas for citations in the digital realm are both philosophical and physical. The philosophical problem is in defining the unit carrying scientific contributions: What should be cited, what should be pointed to? The physical problems are those of longevity and authenticity.

The threat to longevity of online information is primarily one of impermanence of infrastructure: the government or business enterprise on the macro level, and the computer itself on the micro level. For information on tape and CDs, the major threat is technical obsolescence: the fear is that reading and viewing systems (software and hardware) will change in incompatible ways. The deterioration of the physical carrier is also a threat, but is comparatively easier to deal with. In addition to the technical difficulties, there is also a the problem of intellectual property rights. A preservation policy of migration of the information to new systems will require making copies, perhaps digitally; permission to do so cannot be taken for granted.

The problem of authenticity is not unique to the digital realm: Forgeries are legion in the print world, as is fabricated scientific evidence. With digital information it is more difficult to discover forgeries, since copies and originals are indistinguishable. Indeed, one of the benefits of digital information is that it can be adapted to particular users and customized to individual tastes and needs. So digital information has an inherent "weakness" that has to be controlled for in some way.

Alternative Scenarios

If we cannot trust the historical corpus of electronically recorded scientific knowledge, we can use print as archival back-up and rely on traditional systems with libraries and archives to provide the "originals."

If we have concerns about the authenticity of an article, we could, of course, make reviewers and judges the quality guarantors, vouching for the soundness of the scientific contribution.

However, such an approach might reverse the modern development of science and push science back toward authoritarianism and myth, because it would no longer depend on the written record, and instead depend on an individual's opinion. If judges of science have the power to define what is good or bad based on personal interpretations, they can hinder, or at least delay, new knowledge; paradigm shifts will be almost impossible.

Using print archives as a solution to the problems of electronic impermanence and fraud is impractical and expensive; considering the volume of information, they are not really feasible as a solution. So we have to work on developing a reliable structure and organization for citations in the digital realm.

Toward Best Practices

Clearly what is needed is a network of archival servers where electronic information, both granular and exploded, can reside and be guaranteed to be unadulterated when accessed and extracted. Such a network will have to be set up in a large organizational context, maybe on a national level because the commitment must come from an authority with an expected lifespan of centuries. The implications for the organization and financing of such an endeavor are only just beginning to be discussed seriously. The roles of national authorities, universities, libraries, archives, and other parties are gradually being examined and considered.

Creating the network of archival servers will take some time. In the meantime we will have to establish new standards for scientific scholarly publishing. Two ideas that are easy to implement and will go far toward solving some of these problems are extended quotes and redundant citations.

Self-sufficiency in the contributions can be achieved by extended quotes (as illustrated in this article), or by deposit of hard copies at an established caretaker, for example, a university archive or library.

Redundancy in the linking of works can be achieved by adding access points to the individual scholar and his organization, both electronic and physical (the postal address). Some examples are given at the end of this article.

In this way we grant alternative access for the links to ideas, works, and individual scientists; we make connections from the electronic world to the world of atoms, molecules, and living things.



Mats G. Lindquist is director of the main library for science, technology, and medicine (UB2) at Lund University. He has taught and been on the faculty at the schools of library and information science in Turku, Finland and Borås, Sweden. He is a "docent" (Associate Professor) in information management at bo Akademi University in Turku. From 1979-92 he was managing director and marketing manager for the software company Paralog, specializing in information retrieval and text-database management. He started in the field of Library and Information Science in 1970 as a research scholar at the Royal Institute of Technology Library, Stockholm. You may contact him by e-mail at mglindquist@hotmail.com.


References

1. Bernal, J.D., Science in History (Cambridge: MIT Press, 1971), rev. ed., 1:43.return to text

2. Kircz, Joost G., "Modularity: the next form of scientific information presentation?" Journal of Documentation 54(March 1998):210-35. [doi: 10.1108/EUM0000000007185]return to text

3. Shadle, Steven C., "A square peg in a round hole: Applying AACR2 to electronic journals," The Serials Librarian 33(1988):147-66. [doi: 10.1300/J123v33n01_09]return to text

4. Koch, Traugott, "Description of the form and content of resources: Metadata," [formerly http://www.lub.lu.se/tk/metadata/metadata-general.html], accessed Sept. 18, 1998.return to text

5. Lynch, Clifford, "Identifiers and Their Role In Networked Information Applications," http://www.arl.org/newsltr/194/identifier.html, accessed Sept. 19, 1998.return to text

6. Green, Brian, and Mark Bide, "Unique Identifiers: a brief introduction," http://www.bic.org.uk/bic/uniquid, accessed Sept. 18, 1998.return to text

7. Harmsze, Frédérique-Anne P. and Joost G. Kircz, "Form and content in the electronic age," paper presented at the IEEE-ADL '98 Advances in Digital Libraries Conference, Santa Barbara, Calif., April 22-25, 1998.return to text

8. Lindquist, Mats G., "Digital library work: Meeting user needs," http://tiepac.portlandpress.co.uk/books/online/tiepac/session5/ch1.htm, accessed Sept. 12, 1998. In The Impact of Electronic Publishing on the Academic Community, edited by Ian Butterworth, (London: Portland Press, 1997), http://tiepac.portlandpress.co.uk/tiepac.htm, accessed Sept. 12, 1998.return to text


Quotes

From Lindquist: {Bibliographic control} "'Find a model of description and organization that makes logical and physical access efficient and cost effective.' (The term 'bibliographic control' is not semantically correct in the digital order, but will do in the transition period.) The current practice of describing electronic publications in a print-oriented description model such as the cataloguing rules will not be adequate for very long. There is a need for 'EECR' - electronic entities cataloguing rules. A constructive initiative is the one by the (United States) National Institute of Standards and Technology to develop a 'federal information processing standard (FIPS) for a data standard for record description records', announced in the Federal Register 28 February 1995. This initiative also points to another necessary development: that of unifying descriptions in the library and in the archive world; provenance will, for example, be of increasing importance for library material (this issue is sometimes addressed in terms of 'meta-information'). In a world of increasing co-operation and exchange it is not effective to develop local cataloguing rules, but to participate in international work on this. In the transition period it is necessary to find a balance between the efforts spent on describing the traditional form and the electronic forms. To catalogue electronic works is more expensive than for traditional ones, and yet is more important since an undescribed electronic publication is more difficult to handle (and can easily be unusable due to lack of description)." [quote captured Sept. 17, 1998.]


Additional Locations

Lindquist: Employer's Web page: http://www.lu.se Work unit's Web page: [formerly http://www.ub2.lu.se] Private e-mail: mglindquist@hotmail.com

For Koch: Employer's Web page: http://www.lu.se Work unit's Web page: [formerly http://www.lub.lu.se/netlab/index. html]

For Lynch: Work e-mail: clifford@cni.org

For Shadle: Work e-mail: shadle@u.washington.edu

Mats G. Lindquist may be reached by e-mail at mglindquist@hotmail.com.