Saturday, January 30, 2010

Class, label, sensu

In the past couple weeks we've been actively debating what the concepts like synonyms and homonyms mean in the context of the HAO. In particular how do they relate to our concepts of classes (the "real" things at the core of our ontology), and the labels we use to references those classes? We had been using tags to indicated synonyms, but our tag data-model wasn't as elegant as we needed it to be, as exemplified by the fact that too many hacks were being added to the code base to perform various calculations on our data. After much discussion and debate as to the meaning of our various tags in general we realized that we could use a new, simple model which nicely encapsulates what we wanted to capture with respect to synonyms and homonyms, classes and labels: we've called it a "Sensu".

The sensu model is simply a pointer to a label, a class, and a reference (a table with three columns).  It states that so-and-so (a reference) used a label (basiparamere) for a class (The sclerite that is connected proximally with the cupula, distally with the harpe, ventrolaterally with the parossiculus). The Sensu model provides the basic functionality of linking labels to classes.  In addition, from this simple table we can derive nearly everything we wanted to in the past with respect to synonymy and homonymy (and the "acts" of these).  For example, if two sensus share the same class, but different labels, then those labels are synonyms, and if two classes share the same label, then that label is homonymous.  If one person (reference) used two labels in conjunction with the same class, then that person performed the "act" of synonymy, if they used two classes with the same label, then they indicated homonymy.

There are some really nice consequences of this approach: for instance we don't have to specifically identify a "preferred" label or a senior synonym - if we wanted we can calculate these based on some arbitrary metric (e.g. first usage, most used, most voted for, etc.). As long as people can reference the class (some xml or similar markup), they can use whatever label (including other languages) they want for that class. We can also algorithmically detect all cases of synonymy and homonymy without specifically looking for them, i.e. if someone discovers that so-and-so used a label for a class, they need not specifically be intending to synonymize that label, but we can still calculate that a synonym is implied.

In moving this model we've also realized what appears to be a shortcoming to the OBO format, in which a singular class-label construct is the core unit (synonyms can be captured, but each class must have a unique label). The more we work on this project, the more we doubt that we can (or should) enforce labels to have just one meaning (for instance, the classic example process is both a time-based and morphological concept), and we further doubt that this is necessary for us to do some really cool things with the ontology.