Type classification: this is a notes resource. |
Completion status: Almost complete, but you can help make it more thorough. |
This page serves as a place to collect thoughts (and resources) on the topic of managing information for the purposes of:
The objective of collecting these thoughts is to organize a rationale for how to use the various available tools on a collaborative scientific project which is expected to generate information products of lasting value. In the near term, a specific project will be driving the development of this solution. However, it is hoped that the salient features of both the problem and the solution can be distilled into a more generalized understanding which will aid projects of this type.
For the purposes of this effort, modern information systems are defined as being comprised of both human and machine. Humans are involved at the start, and are ultimately the audience for any information. It is ultimately the human's job to understand the information asset in order to effectively discharge their roles as producers and consumers of information. Between the producers and consumers are the custodians, which are often machines.
In the above figure, the rectangles represent individuals involved with the understanding of the material. The circle represents the storage system, which is capable of storing, preserving, indexing and retrieving the material without attempting to understand it.
Information management, in the context of this producer-custodian-consumer model, involves defining the responsibilities and expectations of each role as well as how the roles interact.
If the above were:
Consistency is expected from members of the same collection. There are three levels of consistency:
Automated extraction of information related to characteristic properties is a requirement for large holdings of information resources, and this requires a consistent, deterministic expression. Relying on a human reader to interpret information on a case by case basis is simple to set up but requires a high level of tedious and error-prone effort to maintain. Additionally, the properties which are only accessible to a human reader are not available for searching and filtering.
The criteria for collection membership must be consistently applied to all members of the collection. If criteria are to change or the collections are to evolve, the collection system itself should be versioned, and each member must be updated such that it is consistent with the new definition.
Consider the "collection" of reports in the US Forest Service Fire Effects Information System (FEIS). In the article describing Abies fraseri, the states in which Abies fraseri exists are coded as follows:
<b>STATES : </b> NC TN VA
In the same collection, the report describing the effects of fire on Anas acuta presents the states in which the bird lives as:
<b>STATES : </b> <table cellspacing="5" cellpadding="5" border="0"> <tr> <td>AL</td> <td>AK</td> <td>AZ</td> <td>AR</td> <td>CA</td> <td>CO</td> <td>CT</td> <td>DE</td> <td>FL</td> <td>GA</td> </tr> ... </table>
In the first case, a whitespace separated list is used and in the second case, HTML table markup is used. Other inconsistencies include the designation of the name of the property to which values are being assigned: sometimes a boldface string is used, sometimes a third-level heading is used. The boldfaced string is especially problematic for automatic processing because it is not guaranteed that all bold text signifies a property name. It is only that the bold text is on its own line with a colon after it which causes the human reader to infer that a list of states follows. Likewise, the human will immediately recognize that the states form a "list" and the table markup is merely a device to make the list more compact. There is no significance to states being in the same row or the same column.
In the situation where a human consumer is intended to interpret and process the information, authors are free to use any expression which effectively communicates to the human. When the intent is to supply information to a computer, authors must effectively communicate meaning to the computer. Once the computer understands the content, it can present the information to the human in a variety of ways.
Ideally, there is a one-to-one relationship between a concept and the property value which represents that concept. This ensures that the Producers who supply property values are following the same rules used by the Consumers who construct searches. Falling short of this ideal leads to the situation where producers could tag an article with the string "CO2", and Consumers search for "carbon dioxide". The Custodian will not return the desired information resources because Producers and Consumers used alternate means of expressing the same thing.
To solve this problem in the field of chemistry, the International Union of Pure and Applied Chemistry (IUPAC) has made an attempt to standardize the means of referring to chemical substances and compounds. This standard method is called the International Chemical Identifier (InChI). The fixed length equivalent of this identifier, InChIKey, looks much like a random string of characters. Both identifiers may be automatically generated from a description of the molecular structure itself. In addition, there is a human readable standard name. One public database of many chemical compounds and substances is Pubchem.
For the purposes of having the Producers and Consumers follow the same rules, it is clear that an algorithm which deterministically generates a text representation of a chemical given the structural information is preferable to a human readable name. Human nomenclature tends to evolve over time, and alternate names are often widely recognized by practitioners. Successfully searching by a chemical name may depend on knowing the person who entered the information into the collection of holdings, or knowing when it was entered.
Identification of chemical substances to humans and the computer may require "related properties." These properties should be synchronized to all refer to the same item or concept. In this example:
Property Name | Allowed Values | Example |
IUPAC_Name | Official name of the chemical | carbon dioxide |
InChI | String generated by the InChI algorithm | InChI=1S/CO2/c2-1-3 |
InChIKey | String generated by fixed length InChI algorithm | CURLTUGMZLYLDI-UHFFFAOYSA-N |
InChIVersion | Version of InChI algorithm used to generate identifiers. | 1.02 |
For the case where the number of allowable property values is manageably small, it is possible to enumerate the identifiers. Enumeration effectively controls the representation of a unique concept to the computer, but it does not communicate the concept itself to the human. For instance, setting the "Instrument" property to the value "1" effectively distinguishes instrument number one from all the other instruments. However, it does nothing to communicate anything about that instrument: what does it measure; what is the make, model, serial number of the instrument, when was its most recent calibration?
For each property, it is important to define the list of permissible meanings and the list of permissible representations for those meanings. The representation must be machine readable and distinct from all other representations of the same type. The meaning may be automatically comprehensible (as in a table) or written for human consumption (as in prose).
There are times when it is insufficient to simply define an enumeration of values. At times it is necessary to provide a more complete description about what the enumeration refers to. This strategy collects the descriptive information into one place and allows other managed objects simply to refer to it. When the reference is used as the value of a property, specific meaning can be conveyed: i.e., an "instruments used" property could refer to another object within the system; an "analytes" property could contain a link to a substance description on pubchem. In both cases, the target of the reference explains the meaning of the property value.
When it is necessary to describe specific instances, it is probable that a common "type" has already been defined. Similar if not identical information will be provided about each instance of the type. In the case of an "Instrument", a good, common set of properties might be: "Manufacturer", "Model", "Serial Number", "Last calibration date". It could also be that each instrument has additional/unique information which can be filled out, or which is not machine-interpretable.
There are two options for defining instances of a common type:
An automated custodian becomes more capable when it understands more about the content it is managing. The degree of comprehension can be visualized as a spectrum from "black box" to "completely described"
Three main theorems present themselves:
Mostly from Memory (computers), Computer storage and Memory hierarchy. This spectrum traces the proximity of information to the actor which operates on it. In general, latency increases and bandwidth decreases for items listed later.
Accessibility, in this case, is defined in terms of efficiently retrieving and operating on the data. In most cases, processes which are human-bound (e.g., spend the majority of the time waiting on human input, like word processing) will perform acceptably well with any form of online storage. Processes which are IO bound benefit primarily from increasing the bandwidth which can be sustained between the storage device and the processing unit.
Using information is one of the primary reasons for gathering it in the first place.
In this section, the plan for presenting the information on this page in an orderly way is outlined. More than one article is likely to be required.