Sunday, November 29, 2009

profiling and extracting from the JHR

In preparation for an ESA talk this year we are developing tools to more rapidly search through OCRed text for terms (classes) we might not yet have in the HAO.  As a test case we retrieved all the OCRed text for the Journal of Hymenoptera Research courtesy of the BHL. Because the BHL does not yet have the functionality to return OCRed text by article (you get the whole issue of a volume at once) we manually divided the articles (we have 344) then added them to our database (in mx). In the future we hope that the BHL or someone else will provide an API to return the OCRed text for an article alone, and we can skip the step of manually handling this text.



To get a quick visual of the existing of HAO terms in a given article we added a Google chart (see above).  

The OCRed text is now easily fed to our "proofer" (see part of the form below) which throws up a form of all the unique words (including some neighboring pair combinations) not in the HAO. Common words can be excluded beforehand versus a managed list. This week the real work begins as we run through these data in search of missing terms. We will profile the before and after of this effort to better understand how useful this type of approach is.



As always, the full source for all this work is available on the edge branch of mx on Sourceforge.

Tuesday, November 24, 2009

region, area, headache

A constant and ongoing debate during our Tuesday HAO group discussion is the nature of regions and areas. In the classic approach areas have well defined boundaries, whereas regions have some boundary or compartmentalization that is not well defined. For example a region might be defined in part by a change of color wherein the precise point at which the color changes is not well defined. An example of an area might be the interocellar area, which is bound by three lines that join the centers of the ocelli. One is very precise, the other is not.

In the context of an ontology you can think of the two as matter and anti-matter. An ontology necessarily defines classes in a very specific way. Region, as we have discussed it, is a thing that has at least some non-defined boundary (areas have well defined material boundaries). Therefor regions are undefined, and may not belong in the ontology.

Our temporary solution is to assume that if something is a region in the "no boundaries/can't define a boundary" sense it doesn't belong in the ontology. We've moved everything previously under region to area, and have begun to rework some classical terms in such a way that they can be explicitly defined. For example look at the new definition for gena, a term which has never (to our knowledge) been adequately defined.

Note that areas can be defined by immaterial anatomical entities (e.g. lines connecting ocelli or planes  tangential to some reference line).

Note that this is largely a pragmatic resolution that lets us get on with other things. If you have any feedback or experience in this regard we'd love to hear it.

Thursday, November 19, 2009

more visualization


Turns out it very easy to use ruby-graphviz. With a handful of code we can get nice dot graphs that we can render in programs like Tulip.

Friday, November 6, 2009

That's (not) highly illogical Jim

Our next update to the HAO has been posted to the OBO foundry. While our existing versions passed the OBO edit verifications we were still generating logical redundancies. For example if we know that A is_a B and B part_of C then we can deduce A part_of C. In around 70 cases we were additionally adding this A part_of C statement, these are all now removed. Whether or not these are actually "harmful" is still somewhat unclear, for instance the other arthropod ontologies all have logical redundancies (Spider: 6 , Tick: 2 Mosquito: 1, and Drosophila: 464), and some non-arthropod ontologies have even higher numbers (ZFIN has 2543).

The code to generate the Newick trees seen in the previous post is now live as well. We've made a simple form (see below) that allows users to customize and output trees. Several different node and "clade" highlighting options are available and we plan to expand these as well.

Below is the result of the tree generated above. It shows a categorical summary (increasing number with color darkness) of the number of immediate part_of children.