W3C HCLSig - Subgroup on Text: Unstructured => Structured (T2S)

Draft of Proposal

Started 26 January 2006 by Bob Futrelle

This is to be edited into shape by Bob Futrelle (Northeastern U.) and Matt Cockerill (BioMed Central). As of Fri Jan 27 00:17:17 EST 2006, Matt has not seen this. (He's only now arriving back in London.)

This report is a proposal for action items and deliverables in the T2S area. Here are the main points, as developed by the group and presented at the meeting.

Main points in brief

  1. OBTAIN: Begin with unstructured and semi-structured text that is available.

  2. GENERATE: Starting with the text obtained, generate further structure, e.g., extract entities such as names of proteins, genes, and compounds.

  3. CREATE STRUCTURE: Transform to a strong structure, most prominently, RDF/OWL.

  4. AGREEMENT: Agreement is needed on the form and semantics of the target structures to avoid Babel.

  5. EXPOSE TO USERS: The results must be given to users in a readable form that allows queries, retrieval, edits.

  6. TOOLS: Powering the steps above.

Deliverables and completion times could be:

  1. Elaborate and publish this T2S roadmap. Week 2.

  2. Review existing work on all the above topics. Collect data, tools, and use cases. Month 2.

  3. Critique the design and functionality of the collected materials and systems. Month 4.

  4. Develop best practices document based on critiques. Month 6.

  5. Design and implement demonstrations of prototype system(s) and cases, based on best practices. Month 15.

  6. Submit to broader user community for initial use and reactions. Month 24.

Elaboration of main points

  1. OBTAIN: Begin with unstructured and semi-structured text that is available. The point here is to identify and describe the various sources of text to be given structure. This could include fully flat text (PubMed abstracts), as well as weakly marked up text (HTML, simple XML). Corpora include full-length papers, e.g., BioMed Central (Open Access).

  2. GENERATE: Starting with the text obtained, generate further structure, e.g., extract entities such as names of proteins, genes, and compounds. This is a mini-industry today, so it will not be hard to document. The most extreme structure could be full parsing of text. Semantic markup is generated by some systems. Multi-dimensional markup is a flexible representation system allowing multiple views and multiple levels of analysis.

  3. CREATE STRUCTURE: Transform to a strong structure, most prominently, RDF/OWL. This has not been a prominent end target for natural language text. It will require new thinking. The goals of such structure creation will be the controlling element here. Me must be clear on what should/can be accomplished and why we would want to do so.

  4. AGREEMENT: Agreement is needed on the form and semantics of the target structures to avoid Babel. Must be based on a broad view of what exists in the community as well as what the community might be willing to move to.

  5. EXPOSE TO USERS: The results must be given to users in a readable form that allows queries, retrieval, edits. Both thin and thick clients must be considered.

  6. TOOLS: Powering the steps above.

Elaboration of deliverables and completion times

  1. Elaborate and publish this T2S roadmap. Week 2. Even at this early stage, need to add references to existing concepts and systems to ground the document.

  2. Review existing work on all the above topics. Collect data, tools, and use cases. Month 2. We have the tools to identify existing work. The information could be brought together in a Wiki and/or website.

  3. Critique the design and functionality of the collected materials and systems. Month 4. This requires active experimentation that brings together data, systems, and usage scenarios. The critiques will form a useful report (a deliverable).

  4. Develop best practices document based on critiques (a deliverable). Month 6. This will have creative components, because we may decide that none of the existing approaches can meet goals and challenges that we feel must be met.

  5. Design and implement demonstrations of prototype system(s) and cases, based on best practices (deliverables). Month 15. The emphasis shifts to the subset of implementors. It could involve as little as showcasing the best systems we find or as much as developing prototypes. Given the short time frame, strategies such as modifying or creating plugins for existing systems would be all we could reasonably hope to do.

  6. Submit to broader user community for initial use and reactions. Month 24. Their reactions and our review of their reactions will allow us to outline possible future directions after this two year endpoint.

Comments

Various communities must be kept informed of this work in all of its aspects and at all stages. Public exposure, including postings, Wikis, web sites, talks, and papers can all be used.


Return to Bob's HCLSig main page.