Sunday, November 29, 2009

profiling and extracting from the JHR

In preparation for an ESA talk this year we are developing tools to more rapidly search through OCRed text for terms (classes) we might not yet have in the HAO.  As a test case we retrieved all the OCRed text for the Journal of Hymenoptera Research courtesy of the BHL. Because the BHL does not yet have the functionality to return OCRed text by article (you get the whole issue of a volume at once) we manually divided the articles (we have 344) then added them to our database (in mx). In the future we hope that the BHL or someone else will provide an API to return the OCRed text for an article alone, and we can skip the step of manually handling this text.



To get a quick visual of the existing of HAO terms in a given article we added a Google chart (see above).  

The OCRed text is now easily fed to our "proofer" (see part of the form below) which throws up a form of all the unique words (including some neighboring pair combinations) not in the HAO. Common words can be excluded beforehand versus a managed list. This week the real work begins as we run through these data in search of missing terms. We will profile the before and after of this effort to better understand how useful this type of approach is.



As always, the full source for all this work is available on the edge branch of mx on Sourceforge.

No comments:

Post a Comment