Wednesday, April 18, 2007

SemWeb terms: Controlled vocabulary, Taxonomy, Thesaurus and Onthology

pencil icon, that"s clickable to start editing the post

To move freely in the world of semantic web I'll have to learn the basic terms.

Ontopia has a very good article called Metadata? Thesauri? Taxonomies? Topic Maps!, that does just that and relates it to topic maps that I'll leave for now. Steve Pepper ( whom I've seen doing one of his numerous presentations about Topic Maps/Semantics) has another article called Ten Theses on Topic Maps and RDF on the similarities and differences). Precise definition of the central words are (From 2.1. What is metadata?):

We will use the term object here for the entities being organized, as it does not seem appropriate to assume that they will all be documents in the traditional sense of the word.

and in the footnote:

take 'subject' to mean 'any concept in which the user may potentially be interested'.

and then in 3. Subject-based classification I'm getting started:

Subject-based classification is any form of content classification that groups objects by the subjects they are about.

So in the context of a library I could translate this into:

that groups books (objects) by the subjects they are about.

well if I thought that was hairy it get's worse a couple of paragraphs below:

there is a difference between describing the objects being classified and describing the subjects used to classify them. What we will discuss in this section are the different approaches to describing subjects. Metadata describes objects, and one of the ways in which it does that is by connecting objects to the subjects they are about. We will return to this idea below.

The next definition is one of whose I came for (in 3.1. Controlled vocabularies):

Controlled vocabulary is a rather broad term, but here we mean by it a closed list of named subjects, which can be used for classification. In library science this is sometimes known as an indexing language. The constituents of a controlled vocabulary are usually known as terms, where a term is a particular name for a particular concept. (This is pretty much the same as the common-sense notion of a keyword.)

This is simple, but also in some ways not so simple be cause there are different word with the same meaning called Symonyms (as described in wikipedia):

Synonyms are different words with similar or identical meanings and are interchangeable. Antonyms are words with opposite or nearly opposite meanings (synonym and antonym are antonyms).

in an attempt to reduce the confusion, the word concept is used to define the abstract, well, conecept, and the word term is a concrete name of a concept, and that the same concept may have multiple names. It can also be the other way around that the same term can name multiple subjects, called a Homonym (as described in wikipedia):

A homonym is one of a group of words that share the same spelling or pronunciation (or both) but have different meanings. Examples are stalk (which can mean either part of a plant or to follow (someone) around), and the trio of words to, too and two.

as a sidenote the definition of homonym in wiktionary is even more precise:

A word that sounds or is spelled the same as another word but has a different meaning. (Homonyms are divided into the two overlapping subcategories homographs and homophones. Examples: die and dye (homophones but not homographs); the fish fluke and fluke, part of the tail of a whale (homophones and homographs); the metal lead and the verb form lead (homographs but not homophones.)

The controlled vocabulary consists of deadly terms, and not directly of abstract concepts, meaning the same as the term "subject" used so far.

The purpose of controlling vocabulary is to have a common definition and to be used both when producing and consuming.


The next step in the foodchain, eating some more entrophy, is a taxonomy with the following definition in the articel:

.. use taxonomy to mean a subject-based classification that arranges the terms in the controlled vocabulary into a hierarchy without doing anything further..

The benefit of this approach is that it allows related terms to be grouped together and categorized in ways that make it easier to find the correct term to use whether for searching or to describe an object.

That's is arrange the soup of words in a hierarchical order. The definition of Taxonomy in Wikipedia:

Taxonomy is the practice and science of classification. The word comes from the Greek τάξις, taxis, 'order' + νόμος, nomos, 'law' or 'science'. Taxonomies, which are composed of taxonomic units known as taxa (singular taxon), are frequently hierarchical in structure, commonly displaying parent-child relationships.

Dinge-linge-ling (bells ringing) in my lat post My shortcut into Taxonomy in W3C XML is was sort of surprised over the use of the term class, but now I see the logic of it, since I am doing classification, with sub and super classes - bingo.


A Thesaurus is for a changed defined by two ISO standards describing their structure:

  • [ISO2788] describes monolingual thesauri
  • [ISO5964] is for multilingual thesauri

Though the article notes that in practice many users extend the structure somewhat, and in some cases the term are applied to structures differing substantially from what is described here. Thesauri takes taxonomies and adds the ability to make other statements about the subjects:

Short for broader term, refers to the term above this one in the hierarchy; that term must have a wider or less specific meaning. One could say that taxonomies as described above are thesauri that only use the BT/NT properties to build a hierarhcy, and don't make use of any of the properties described below, so it could be said that every thesaurus contains a taxonomy
Short for scope note.
Refers to another term that is to be preferred instead of this term; implies that the terms are synonymous.
Short for top term, and refers to the topmost ancestor of this term.
Short for "related term", refers to a term that is related to this term, without being a synonym of it or a broader/narrower term.

Faceted classification

I didn't really know of Faceted classification but there's a nice example on wikipedia:

The most prominent use of faceted classification is in faceted navigation systems that enable a user to navigate information hierarchically, going from a category to its sub-categories, but choosing the order in which the categories are presented. This contrasts with traditional taxonomies in which the hierarchy of categories is fixed and unchanging. For example, a traditional restaurant guide might group restaurants first by location, then by type, price, rating, awards, ambiance, and amenities. In a faceted system, a user might decide first to divide the restaurants by price, and then by location and then by type, while another user could first sort the restaurants by type and then by awards. Thus, faceted navigation, like taxonomic navigation, guides users by showing them available categories (or facets), but does not require them to browse through a hierarchy that may not precisely suit their needs or way of thinking.

and as described in the article:

... works by identifying a number of facets into which the terms are divided. The facets can be thought of as different axes along which documents can be classified, and each facet contains a number of terms. How the terms within each facet are described varies, though in general a thesaurus-like structure is used, and usually a term is only allowed to belong to a single facet ...


Finally on to the big fat lady of semantics Onthologies:

... the core meaning within computer science is a model for describing the world that consists of a set of types, properties, and relationship types. Exactly what is provided around this varies, but this is the essentials of an ontology. There is also generally an expectation that there be a close resemblance between the real world and the features of the model in an ontology.

and in wikipedia:

In both computer science and information science, an ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain.