Wednesday, August 12, 2009

.obo encoding issues

We successfully submitted the HAO to the OBO Foundry last week(!), and we hope to ascend to candidacy after some testing / evaluation / reading / questioning / etc. by the OBO community. If you access our submitted version through SourceForge here you will probably notice some diacritical messiness, mainly this --> �

That's what happens to each letter embellished with diacritics. Each instance of Mikó and Pénzes becomes Mik� and P�nzes. This is an issue because our data are UTF-8 encoded in the database.

Why not convert all those references and names to xref IDs? We will, as soon as we know our IDs will be stable. UTF-8 support will remain an important component of our ontology, though, as we go forward. Ultimately we will be attempting to account for ALL commonly used terms in Hymenoptera species descriptions, and many of those are in (accented) French, German, and Spanish: carène intertorulaire, prépectus, écailles, Ringstück, etc.

The "encoding: UTF-8" tag is supposed to be available in OBO 1.3 spec. I know we aren't the only ones longing for UTF-8 support in OBO Edit and other tools, so developers out there - take this as a little nudge.

No comments:

Post a Comment