This document discusses the use and limitations of scientific names in biological informatics. It notes that all information about a species is tied to its scientific name, which serves as a link between past and present knowledge. However, changes in syntax and semantics of scientific names over time can impact precision and recall in information retrieval. The use of distinct taxon identifiers can help address issues arising from name changes.
1 of 23
Download to read offline
More Related Content
Use and Limits of Scientific Names in Biological Informatics
4. Names as descriptive metadata
All accumulated information of a species is tied to
a scientific name, a name that serves as a link
between what has been learned in the past and
what we today add to the body of knowledge.
- Grimaldi & Engel, 2005, Evolution of the Insects
9. Name Specimen
Concept
Its new!
Im famous
Name Specimen
Concept
Thats one
of those
evokes
referredto
1
2
evokes
refers to
3
4
Communication of meaning
A B
11. True Negatives
False Positives
False Negatives
Relevant Elements
What you want
Selected Elements
What you got
True Positives
Relevance in information retrieval
#2: My intention is to extend the "artifical world" that Rich introduced into a framework that we use to cast some (not all) of the issues and observations we will see in the next presentations. My intention is to use this framework as a means to better inform the future directions of the CoL. I hope to demonstrate that the information components of the Catalogue of Life provide a critical basis for ensuring that biological information is accessible in units that make biological sense. When it comes operating within this artificial world I hope to make the argument that, without taxonomy, it isnt biology.
#3: Scientific names serve to label biodiversity information: information related to species providing (if not the sole, than the key) biological context to associated content, data, information, etc. This includes not just physical observations but, more importantly, anything we record as data, information, or knowledge related to a species.
#4: Scientific names label data objects you might traditionally associate with biodiversity: specimens, surveys, samples, etc. They also, however, provide the sole biological and evolutionary context for gene sequences, scientific publications, images, books, non-scientific articles, news stories, etc.
#5: This system of utilizing names for taxa has been in use for over 250 years. As a result, ALL information related to a species is labeled with a name, or, as we will hear today, some sort of identifying label. Names therefore, serve as identifiers for taxa the same way we use symbols, numbers and labels as identifiers for objects we refer to in other aspects of our lives.
This ubiquity would imply a key role for names as identifiers for accessing information related to species since increasingly, data and information of all stripes is available online.
If only we can find it.
Names, and their related taxonomic definitions, however, present instabilities that limit their use as identifiers in information retrieval. These problems and their ramifications can impact the integrity of the use and analysis of biological data.
#6: Semiotics is the study of meaning-making. It provides a useful model for describing the relationship between symbols such as names, and the objects to which they refer.
#7: Semiotics distinguishes syntactics, which governs the rules and relationships among names, from semantics, which represents the relations between those labels and the objects to which they refer.
#9: The relationship between syntax and semantics, and how it intersects our discussion on biological taxonomy can be illustrated with the triangle of reference, or the semiotic triangle. In the model, there is no direct relationship between the name and the real-world object, the bird, it represents. Meaning, or the relationship between the name and the object, is conveyed only through a concept that exists in the mind of the user of the name.
#10: In taxonomy, a biologist (A) determines a specimen is sufficiently distinct to constitute a new species and documents the concept or idea of this novelty to a publication and assigns a name to it. Another person (B) subsequently reading the name, perhaps as a label on a specimen, evokes the concept originally described by the biologist, to refer to the specimen. Accurate communication occurs when there is congruence between both concepts among the writer and the reader.
#11: In order to function as useful identifiers in information retrieval, be it by visiting the library in person and going through the shelves or searching online, the relationship between a name and an identifier needs to be stable and unique. It needs to be one to one. This is why your social security number makes a good identifier and you name does not. Not only did my sisters name change when she got married she is also not the only Linda Richardson in the country. Likewise, thumbs-up can mean good in America and it can also mean may I have a ride but in some parts of the world it might get you punched in the nose.
In biology the relationship between nomenclature and taxonomy is consistent. Both syntax and semantics are subject to change. This inconsistency places limits on how names may be used in biological informatics in initially anchoring, and in the subsequent retrieval and integration, of relevant biodiversity information
#12: Relevance in the context of information retrieval as two measures: Precision and Recall. This model provides demonstrates how they differ. As any Google search will demonstrate, the results retrieved via a keyword search do not always deliver what you asked for. Furthermore, you have no way of knowing if some relevant content was missed.
Precision refers to the proportion of relevant objects returned in a search. False positives are those items returned that are not relevant.
Recall is the proportion of relevant objects that are returned relative to all relevant objects actually available. False negatives are relevant items that were not returned.
#13: Bringing these four items together allows me to now articulate how we must use taxonomic sources like the Catalogue of Life to support the publication (or sharing), access and scientific analysis and use of biological data. I will illustrate where the current system places some limits on this use and some later presentations will demonstrate how they are pushing these limits. Ill try to illustrate how these issues can threaten the delivery, integrity and scope of biological data and the precise nature of that impact.
#14: Many of you are familiar with the Woods Hole squid, Loligo pealeii, and its giant axon that has been used for decades as a neurophysiological and model. This name was originally published in 1821 by Lasueur. Ten or twelve years ago, Michael Vecchione and others published a revision of the loliginids that resulted in this species being transferred to a different genus. It was the same species, the semantics didnt change, but just like the syntactic conventions that result in my sister changing her name when she got married, the nomenclatural rules that have a species name composed of a genus part and a species part result in a new name, Doryteuthis pealeii. In this case the genus, Doryteuthis was further sub-divided with the result of this more complex compound name.
#15: The impact of these nomenclatural changes is not hard to predict. With more than one name referring to the same taxon, a person seeking information about it must utilize both names to retrieve all relevant information. In addition, people who work with this species may not know, or even agree, with this genus change such that both names continue being used. Here are two articles published by MBL researchers demonstrating this. Its also easy to imagine these two names being mis-interpreted as referring to two different species. John Furfey will provide more details and examples of this in his upcoming talk.
#16: The formal rules of nomenclature, combined with latitude in how people record scientific names can conflate the proportion of names one must account for in order to access relevant information related to a species. Its not hard to see how relying on a correctly formed name can easily result in a negative match linked to relevant data or information.
#17: Lets look at how a change in semantics with no corresponding change impacts relevance and what this means for scientific use of biological information. Pneumocystis carinii is opportunistic fungal pathogen that causes an often deadly pneumonia in immune-compromised people. It was originally isolated in rats and dogs.
#18: In 2002, new molecular evidence led to an assertion that the form that infects humans was distinct from the animal form. This led to a splitting of the traditional concept of the taxon and the creation of a new name, Pneumocystis jiroveci.
#19: This did not go unnoticed or unchallenged within the medical research community. It is, however, common and part of the dynamism that is modern taxonomy. Every year, approximately 1% of all scientific names become invalidated, either because a reclassification results in a new syntactic change or two or more taxa are merged (a semantic change) and one of the names is no longer used. In this case a species was split and lets quickly explore the consequences of this.
#20: In the case of Pneumocytis carinii, the original, pre-2002 form consisted of organisms that infect rats, dogs and humans. Following 2002, Pneumocytis carinii only refers to the taxon that infects non-humans. In taxonomy these two different circumscriptions for the same nominal taxon are known as different taxon concepts. The semiotic term is polysemy or multiple-meanings. Please note that, while this occurred serially, it is not uncommon for different taxon concepts to occur in parallel with supporters for each different circumscription. Putting aside the new taxon P. jiroveci, for a moment, lets look at the informatics consequences of polysemy.
#21: Polysemy results in a reduction in precision in information retrieval. The use of the name, Pneumocytis carinii, since it is conserved and use identically after the split, is ambiguous. Studies that focus only on the current sense of the taxon cannot rely on the use of the name alone to retrieve relevant data objects. Human-related instances representing false-positive results will dominate. In some cases, where such ambiguity is known, users of these data know to scrutinize at the object level to disambiguate these results. In many cases, however, the proceedings of taxonomic expertise are unknown and data retrieval and subsequent use does not account for this ambiguity. The result is analyses, inferences and possibly conclusions that are less precise than assumed.
#22: Its important to recognize that this impact on precision represents a limit to the use of scientific names in their application to taxa. Conserving the same name results in three distinct taxa and two distinct labels. Recall that identifier stability requires a 1 to 1 relationship with between syntax and semantics. One vision under discussion this week is a global taxonomic clearinghouse that catalogs and uniquely identifies these different semantic views and utilizes an updated set of nomenclatural rules to uniquely identify each concept using terms a biologist might actually adopt.
#23: This slide discusses how even with the minting of identifiers ambiguity remains until and unless those identifiers are retrospectively applied within objects recorded in the sense of the earlier merged concept. This requires an object by object evaluation.
#24: Wrap with an extreme view of the MANY-TO-MANY relationship between SYNTAX and SEMANTIC. Halichondria in the sense of WORMS/SOEST is the result of grouping a large set of previously described taxa for this cosmopolitan species. Many of these taxa include additional combinations, creating an enormous set of both homotypic and heterotypic synonyms. The net result from a scientific use standpoint is that 1) you need to include all these names within search of heterogenous data systems in order to ensure high recall. But given the high degree of semantic change you will have to deal with potentially significant ambiguities in precision. Data objects linked with these names may refer to completely different species today. CAVEAT EMPTOR.