Sunday, November 29, 2009

profiling and extracting from the JHR

In preparation for an ESA talk this year we are developing tools to more rapidly search through OCRed text for terms (classes) we might not yet have in the HAO.  As a test case we retrieved all the OCRed text for the Journal of Hymenoptera Research courtesy of the BHL. Because the BHL does not yet have the functionality to return OCRed text by article (you get the whole issue of a volume at once) we manually divided the articles (we have 344) then added them to our database (in mx). In the future we hope that the BHL or someone else will provide an API to return the OCRed text for an article alone, and we can skip the step of manually handling this text.

To get a quick visual of the existing of HAO terms in a given article we added a Google chart (see above).  

The OCRed text is now easily fed to our "proofer" (see part of the form below) which throws up a form of all the unique words (including some neighboring pair combinations) not in the HAO. Common words can be excluded beforehand versus a managed list. This week the real work begins as we run through these data in search of missing terms. We will profile the before and after of this effort to better understand how useful this type of approach is.

As always, the full source for all this work is available on the edge branch of mx on Sourceforge.

Tuesday, November 24, 2009

region, area, headache

A constant and ongoing debate during our Tuesday HAO group discussion is the nature of regions and areas. In the classic approach areas have well defined boundaries, whereas regions have some boundary or compartmentalization that is not well defined. For example a region might be defined in part by a change of color wherein the precise point at which the color changes is not well defined. An example of an area might be the interocellar area, which is bound by three lines that join the centers of the ocelli. One is very precise, the other is not.

In the context of an ontology you can think of the two as matter and anti-matter. An ontology necessarily defines classes in a very specific way. Region, as we have discussed it, is a thing that has at least some non-defined boundary (areas have well defined material boundaries). Therefor regions are undefined, and may not belong in the ontology.

Our temporary solution is to assume that if something is a region in the "no boundaries/can't define a boundary" sense it doesn't belong in the ontology. We've moved everything previously under region to area, and have begun to rework some classical terms in such a way that they can be explicitly defined. For example look at the new definition for gena, a term which has never (to our knowledge) been adequately defined.

Note that areas can be defined by immaterial anatomical entities (e.g. lines connecting ocelli or planes  tangential to some reference line).

Note that this is largely a pragmatic resolution that lets us get on with other things. If you have any feedback or experience in this regard we'd love to hear it.

Thursday, November 19, 2009

more visualization

Turns out it very easy to use ruby-graphviz. With a handful of code we can get nice dot graphs that we can render in programs like Tulip.

Friday, November 6, 2009

That's (not) highly illogical Jim

Our next update to the HAO has been posted to the OBO foundry. While our existing versions passed the OBO edit verifications we were still generating logical redundancies. For example if we know that A is_a B and B part_of C then we can deduce A part_of C. In around 70 cases we were additionally adding this A part_of C statement, these are all now removed. Whether or not these are actually "harmful" is still somewhat unclear, for instance the other arthropod ontologies all have logical redundancies (Spider: 6 , Tick: 2 Mosquito: 1, and Drosophila: 464), and some non-arthropod ontologies have even higher numbers (ZFIN has 2543).

The code to generate the Newick trees seen in the previous post is now live as well. We've made a simple form (see below) that allows users to customize and output trees. Several different node and "clade" highlighting options are available and we plan to expand these as well.

Below is the result of the tree generated above. It shows a categorical summary (increasing number with color darkness) of the number of immediate part_of children.

Friday, October 23, 2009

More visualization experimenting

We're working on some figures for the HAO announce paper that should be submitted any day now (tm). This is an early attempt at visualizing the complexity within the HAO. It is the is_a graph spit out from mx in Newick format with randomly chosen colors assigned from a ColorBrewer palette. Rendered in Figtree.

Wednesday, September 23, 2009

visualizing the HAO

A sneak preview of a HAO and protovis mashup. This is the "is_a" tree.

Friday, September 18, 2009


A short update on the status of the HAO - Our progress since has largely been technical, with some major improvements to how we manage getting new versions of the HAO out of mx. These changes allow us to more confidently generate new versions, with less fear of there being logical problems with the output. In addition, a new version of the HAO has been committed and should appear shortly, thanks to those of you who caught problems with it and notified us of them.

Another important update includes handling "obsolete" tags, which are required when we make a term that has an HAO id a synonym of another term. We're already confronted with questions like like these - How do you determine which term is the right one? Why is this term obsolete? - from our colleagues.

Synonyms and obsolete status, from our perspective at least, are not quite as big a deal as our users might feel, largely because they are meant to indicate a particular status to parsers reading the ontology (i.e., machines) rather than acting as decisions or recommendations to our users. The HAO has to be a logically consistent entity that is largely focused on defining and capturing anatomical classes. The particular labels we give to these classes are not necessarily as important as how we define the class. I.e., if a bunch of hymenopterists can agree on what that their colleague means by "that little point bit over there," then well over half the battle is won. By obsoleting or synonymizing a term in the HAO we are saying that two or more classes in the ontology are really the same thing, i.e. "that little pointy bit over there". It just happens that we need at least one label for a given class, this is often called the "preferred term" in ontology-speak, but we have to be careful to not worry about this semantic. We're not saying "You, hymenoptera taxonomist, don't use that word!" We're saying if you use that word, we're going to redirect you to more information about the "little pointy bit" others are talking about.

Several other cosmetic, and work flow related changes have been made to mx. We've made it easier to navigate back and forth across synonym tags. We've also formalized the Tag object in anticipation of creating links to external resources, including on data to the BHL and HOL digital literature databases.

Saturday, August 29, 2009

Australia, taxonomy, and integration

I recently traveled to Canberra, Australia thanks to an invite from John La Salle at CSIRO. John is one of the driving forces behind the Atlas of Living Australia, a initiative recently funded on the scale of EOL. While there I attended a couple of short workshops on phenomics and biosecurity and presented talks on the HAO. One recurring topic was how ontologies could be used to speed taxonomy. There are obvious applications, for instance our text markup proofing tool, but clearly a lot more is possible.

La Salle, for example, is pushing the idea of automated character recognition (PDF). Imagine an anatomy ontology, such as the HAO, that has associated with it a large number of annotations on images - e.g., that light micrograph of an ant head has polygons and/or point markers superimposed over it that indicate which structures are the compound eyes, ocelli, face, antennae, etc. These core data could act as the basis for algorithmic recognition on images of other ants. While this functionality is largely science fiction at the moment, it's possible that the technology will be folded into the taxonomic workflow within my lifetime. We will spend a lot of time in the next three years thinking about this and other ways the HAO can be used to address the taxonomic impediment.

Integrating the HAO with taxonomy maybe be somewhat abstract, but by the virtue of being an OBO format file available to the others the HAO is already being integrated into the ontological world (semantic web?). Richard Cole wrote to us to point out that the HAO is now visible via the Ontology Lookup Service. This service provides a nice graph to visualize and navigate the ontology. If you have other useful or cool applications that use OBO ontologies let us know and we'll point them.

Thursday, August 20, 2009

Come work with us!

Undergraduate Position Available Immediately for Fall Semester

The Hymenoptera Anatomy Ontology (HAO) project is looking for a student interested in learning biodiversity informatics, library science, entomology, imaging and scientific illustrating techniques, modern museum studies and Web design.

Hours are flexible, but will be between 8am-5pm.
Salary: $8.50 per hour (10-15 hours a week)

Responsibilities may include:
* extraction of data from historical/scientific texts (primary responsibility)
* testing and design of Web-based tools for outreach (i.e., educating non-scientists
about insects, insect anatomy, and biodiversity informatics)
* development of visual/semantic Web interfaces for novel rapidly growing dataset
* learning light microscopy imaging techniques
* learning and illustrating Hymenoptera anatomy
(i.e., what the body parts of ants, bees, wasps, and sawflies are called)

More information about the HAO project is available:

Contact: Katja Seltmann: katja_seltmann (at) ncsu (dot) edu; Room 3212 Gardner Hall; 5-2833

Wednesday, August 12, 2009

.obo encoding issues

We successfully submitted the HAO to the OBO Foundry last week(!), and we hope to ascend to candidacy after some testing / evaluation / reading / questioning / etc. by the OBO community. If you access our submitted version through SourceForge here you will probably notice some diacritical messiness, mainly this --> �

That's what happens to each letter embellished with diacritics. Each instance of Mikó and Pénzes becomes Mik� and P�nzes. This is an issue because our data are UTF-8 encoded in the database.

Why not convert all those references and names to xref IDs? We will, as soon as we know our IDs will be stable. UTF-8 support will remain an important component of our ontology, though, as we go forward. Ultimately we will be attempting to account for ALL commonly used terms in Hymenoptera species descriptions, and many of those are in (accented) French, German, and Spanish: carène intertorulaire, prépectus, écailles, Ringstück, etc.

The "encoding: UTF-8" tag is supposed to be available in OBO 1.3 spec. I know we aren't the only ones longing for UTF-8 support in OBO Edit and other tools, so developers out there - take this as a little nudge.

Friday, August 7, 2009

iterating through the low hanging fruit

Perhaps a little behind schedule with the blog post, but we've been busy!

After several weeks of concentrated editing the first "version" of the HAO is in the hands of the folks at the OBO Foundry, to be included as a candidate (it should appear in the next several daysit's here). It was interesting to learn during this process (not process sensu HAO:0000822, but process sensu evo-devo) that no ontology is actually included in the foundry, they are all candidates.

The HAO is indeed a candidate in many senses of the word, this first effort is largely to get the editing team comfortable with the steps it takes to release versions of the HAO, and the basic skeleton logic into the hands of those who can start to provide us feedback. That said, we feel pretty good about this initial effort, even though we have perhaps been gathering the low hanging fruit. We have a full fledged ontology output from a web-based application (albeit with a hack here or two), and around 90% of the terms contain definitions in human-written genus differentia formats. We've also generated HAO ids for over 1000 terms, which is an important first step towards allowing others to reference fixed points in the ontology in meaningful ways. Perhaps most importantly, we have a product that people can begin to provide critical feedback on, like "where's the nervous system" (our first comment from a non-project member, it's not in there...yet). We're depending on this feedback, both from experts on ontologies in the broader sense, and from morphologists with much more experience than us.

Along with work on the HAO itself has come some feature development for handling the ontology. We're using tags to comment and annotate the HAO. Tags in mx contain a keyword, and an optional pointer to a reference, and option comment or "value". To make tags more useful on a day to day basis we hacked up a tag browser (see above) which lets us quickly return sets and then navigate to the results.

We also generated a quick tree viewer to browse through the ontology. Watch for a public version of the viewer to appear on the glossary in the following months. The tree gives us context, allows us to quickly edit the definitions, and we can drag terms to add relationships.

Monday, July 20, 2009

Towards HAO 0.1

One central premise we're following is to release early, release often. Given this we're planning to release a very early draft of the ontology within the next week or two. We hope that releasing an early draft will drive feedback early on in the projects lifespan. The first releases will undoubtedly contain errors and glaring omissions, but it will also let people start to use the real meat of our project, the ontology itself. The first draft will only include two relationship types (is_a, part_of). As the project continues we'll consider and adopt others (has_part?), there is much talk of integrating spacial relationships (in_contact_with, adjacent_to), but these will have to be carefully reviewed.

To accommodate the editing, review, and tweaking of the ontology we're also starting to ramp up the patches to mx, the underlying editor that we're using to build the ontology. These changes are posted immediately to our SVN repository on Sourceforge (yes, we're considering a move to git). In the future, we plan to split the ontology code out of mx and turn it into a gem (plugin) that any Rails project can easily use.

Curation of the ontology is currently focused on updating all the present definitions to a genus-differentia style, and cross referencing terms to other ontologies. One stumbling point we're hitting is what to do with accented characters, since the OBO specs don't presently permit them.

Wednesday, July 8, 2009

The HAO team assembles...

The Hymenoptera Anatomy Ontology team continues to grow, with the recent arrivals of Katja Seltmann (2nd from left) from Morphbank and Matt Yoder (far right) from the Platygastroidea PBI. Katja and Matt join István Mikó (far left) from the NCSU Insect Museum, who's been helping me (Andy Deans; 3rd from left) sort out confusing character complexes for the last six months. We now have the critical mass needed to ramp up development of the HAO and associated tools.

In that regard we've set up several outlets for discussion and news reporting, should you be interested:
  1. the HAO Project is on on twitter
  2. you've already found our blog
  3. we post(ed) discussion items (and our proposal) on the HAO wiki
  4. the ontology can be perused on the Hymenoptera glossary website
The HAO currently has 2,408 classes (i.e., anatomical terms) and 2,537 relationships. Our first order of business is to clean up a few terms (e.g., converting a few straggling definitions to genus-differentia) and then deposit the ontology at the OBO Foundry. We'll be growing again in August, with the arrival of a new grad student, and then again in October. Until then watch for frequent updates!