DanQingShanShui (483K)

Semantic Annotation for the Semantic Web

The semantic web is a mesh of information linked up in such a way as to be easily processable by machines, on a global scale. A characteristic of semantic web content is that it is annotated in a machine-understandable fashion. To our opinion, there are three properties that ensure machine understandable: explicit makes an annotation publicly accessible, formal makes an annotation publicly agreeable, and unambiguous makes an annotation publicly identifiable. The process of upgrading the actual web pages to be machine-understandable semantic web pages is named the process of the web semantic annotation.

Currently, the web semantic annotation research is my major interest. There are also some other research projects in which I have involved or am involving.
Semantic Annotation


A typical semantic annotation process includes three components. First, an ontology describes the domain of interest. Second, a data instance recognition process discovers all instances of interest in target web documents based on the defined ontology. Third, an annotation generation process creates a semantic meaning disclosure file for each annotated document. Through the semantic meaning disclosure file, any ontology-aware machine agent can understand the target document.

Two-Layer Semantic Annotator (part of my disseration)

There are three issues important to a semantic annotation system: accuracy, speed of execution, and resiliency of the annotation tool. High accuracy is a performance requirement for any system, so is speed. Because there are so many web pages to annotate, slow execution, even with high accuracy, is problematic. Resiliency is a requirement issue for applications that are oriented to generic web pages. In our case, we want the annotation process to work continuously without the overhead of regenerating extraction patterns according to the change of web page layouts, and we want the same annotation process to work immediately with new pages within the domain when they come online. Our Two-Layer Semantic Annotator is designed to be such a generic annotation tool that fulfills these three requirements.
The right figure shows our two-layer annotation model. The lower layer conceptual annotator uses an ontology-based IE tool. Since ontology-based IE tools are resilient to changes of web pages, so do the conceptual annotator. Hence the conceptual annotator will be able to work immediately on web pages within the domain when they come online. The upper layer structural annotator uses a layout-based IE tool. Since layout-based IE tools execute fast, and, when properly constructed, have high accuracy, the structural annotator will also execute fast and have high accuracy. In general, the system will pass an arbitrary input web page to the conceptual annotator to fulfill the requirement of resiliency. When there are a large set of input documents that follows a similar layout pattern, the system will automatically build a structural annotator based on the results of the conceptual annotator according to a small set of sample web pages. Then the system will use the dynamically created structural annotator to annotate the rest of massive number of documents in a fast and high accurate way.


Literature Background

People have studied either manual or automatic ways to annotate web pages. Manual annotation research focuses more on annotation representation, sharing and storage mechanisms as well as friendly user-interface to help people write down their notes. On the other hand, research about automated annotation tools focuses more on the ways of creating annotations according to specified domain ontologies.

The representative manually annotating tool is Annotea. Annotea uses an RDF-based annotation schema for describing annotations and XPointer for locating the annotations in the annotated document [1]. The RDF-based annotation schema is a simple version of an ontology. Besides Annotea, there are some other manual annotation tools that could potentially be used for semantically annotating the web, which include but are not limited to: CritLink, CoNote, Futplex, Annotator, ComMentor, and ThirdVoice. [2] is a good online survey for these tools.

Most automatic semantic annotation tools assume a given pre-constructed domain ontology to avoid the difficult automatic ontology generation problem. But since they do not use ontology-based IE approaches to be their data recognizing engine, they all have to deal with the problem of aligning concepts in ontologies with data categories defined in the adapted IE tools.  Here are some representative ones.

Ont-O-Mat [3] is the implementation of the S-CREAM, a framework that proposes both manual and semi-automatic annotation of web pages. Ont-O-Mat adopts automated data extraction technique from Amilcare, which is an adaptive IE (Information Extraction) system designed for supporting active annotation of documents. Amilcare learner is based on (LP)2, a covering algorithm for supervised learning of IE rules based on Lazy-NLP (Natural Language Processing). S-CREAM proposes a set of heuristics for post-processing and mapping of IE results to an ontology. Ont-O-Mat provides ways to access ontologies specified in a markup format, such as RDF and DAML+OIL . But there is only one ontology that can be accessed at one time. Ont-O-Mat can store pages annotated in DAML+OIL using OntoBroker as an annotation server. Ont-O-Mat stores the mark-up information in the web pages. It also provides crawlers that can search the web for annotated web pages to add to its internal knowledge base.

Organization: AIFB (Institute of Applied Informatics and Formal Description Methods) at the University of Karlsruhe
Ontology Langauge: DAML+OIL, RDF
Input Ontology: only one per task
Annotation Representation: within web pages


MnM [4] is very similar to Ont-O-Mat. It provides both automated and semi-automated support. MnM integrates a web browser with an ontology editor. In addition, to provide ways to access ontologies specified in a markup format, which is the same way as Ont-O-Mat does, MnM also provides open APIs, such as OKBC (Open Knowledge Base Connectivity), to link to ontology servers and for integrating IE tools. Furthermore, unlike Ont-O-Mat, MnM can handle multiple ontologies at the same time. Beyond this, MnM shares almost all the other features as for Ont-O-Mat. As stated by the authors, the difference between the two systems is their philosophies. While Ont-O-Mat adopts the philosophy that the markups should be included as part of the resources, MnM stores their annotations both as markups on a web page and as items in a knowledge base.

Organization: KMI (Knowledge Media Institute) at Open University of United Kingdom
Ontology Langauge: DAML+OIL, RDF
Input Ontology: one or more than one per task
Annotation Representation: within web pages and in a knowledge base


The KIM platform [5] is a part of the SWAN (Semantic Web ANnotator) project. The KIM platform consists of a formal KIM ontology and a KIM knowledge base, a KIM Server (with an API for remote access or embedding), and front-ends that provide full access to the functionality of the KIM Server. The KIM ontology is a light-weight upper level ontology that defines the entity classes and relations of interest. The authors chose RDF(S) as their ontology representation language. The KIM knowledge base contains the entity description information for annotation purposes. During the annotation process, KIM employs an NLP IE technique, which is based on GATE (General Architecture of Text Engineering) to extract, index, and annotate data instances. The KIM Server coordinates multiple units in the general plat-form. The annotated information is stored inside the web pages. KIM front-ends provide a browser plug-in so that people can view those annotated information graphically through different highlighted colors in regular web browsers such as Microsoft's Internet Explorer.

Organization: DERI (Digital Enterprise Research Institute) at the National University of Ireland at Galway, also involving the GATE research team, and the OntoText laboratory of Sirma AI Ltd.
Ontology Langauge: RDF(S)
Input Ontology: KIM Ontology
Annotation Representation: within web pages


SemTag [6] is the largest scale semantic tagging effort that has been done until now. It is part of the WebFountain research project. Almaden's researchers applied SemTag to annotate a collection of approximately 264 million web pages and generate approximately 434 million automatically disambiguated semantic tags, which are published to the web as a label bureau providing metadata regarding the 434 million annotations. SemTag uses the TAP ontology to define annotation classes. The TAP ontology is very similar in size and structure to the KIM ontology and knowledge base. To overcome the disambiguation problem, SemTag uses a vector-space model to assign the correct ontological class or to determine that a concept does not correspond to a class in TAP. The disambiguation is carried out by comparing the context of a concept (10 words to the left and 10 to the right) to the contexts of instances in TAP with compatible aliases. Fortunately, TAP does not have many entities that share the same alias, which makes the task of disambiguation easier. The SemTag system is implemented on a high-performance parallel architecture, where each node annotates about 200 documents per second. The authors of [6] reported that the correctness of annotation is about 80% when they used 24 internal nodes. The authors did not mention what the annotated format is and how the annotated information is stored.

Organization: IBM Almaden Research Center
Ontology Langauge: TAP ontology language
Input Ontology: TAP Ontology
Annotation Representation: within web pages


References:

[1] Jose Kahan, Marja-Riitta Koivunen, Eric Prud'Hommeaux, and Ralph R. Swickd. Annotea: An Open RDF Infrastructure for Shared Web Annotations. In Proceedings of the Tenth Inter-national World Wide Web Conference, Hong Kong, China, May, 2001. pp. 623-632.

[2] Rachel M. Heck, Sarah M. Luebke, and Chad H. Obermark. A survey of web annotation systems. Technical report, Grinnell College, Grinnell, Iowa, 1999.

[3] Siegfried Handschuh, Steffen Staab, and Fabio Ciravegna. S-CREAM - Semi-automatic CREAtion of Metadata. In Proceedings of the European Conference on Knowledge Acquisition and Management (EKAW-2002), Madrid, Spain, October, 2002.

[4] Maria Vargas-Vera, Enrico Motta, John Domingue, Mattia Lanzoni, Arthur Stutt, and Fabio Ciravegna. MnM: Ontology Driven Tool for Semantic Markup. In Proceedings of the Workshop Semantic Authoring, Annotation & Knowledge Markup (SAAKM 2002) with ECAI 2002, Lyon, France, July, 2002.

[5] Atanas Kiryakov, Borislav Popov, Ivan Terziev, Dimitar Manov, and Damyan Ognyanoff. Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics, 2(1):49-79, December 2004.

[6] Stephen Dill, Nadav Eiron, David Gibson, Daniel Gruhl, R. Guha, Anant Jhingran, Tapas Kanungo, Kevin S. McCurley, Sridhar Rajagopalan, Andrew Tomkins, John A. Tomlin, and Jason Y. Zien. A Case for Automated Large Scale Semantic Annotations. Journal of Web Semantics, 1(1):115-132, December 2003.

Ontology Assembling



Ontology assembling study is another part of my dissertation research. The intuition of this study is to improve our semantic annotation system to be more automatic. As we know, many automatic or semi-automatic semantic annotation tools assume that the input domain ontology must have been pre-constructed. Although this assumption simplifies the annotation problem, it does not solve the problem of how to get an ontology to do annotation, while the latter problem is unavoidably critical.

When we study closely on this ontology creation problem, we find that there are significant pros and cons sides of this problem in the particular semantic annotation environment. From the negative point of view, we find that almost none of any two web pages are the same; and thus we may need many different domain ontologies. On the other hand, however, positively, we find that there are many overlapped and repeated components in these many different domain ontologies. For example, if there are two ontologies, one about professors and another about company organizations, a new ontology about university employees is basically a subset of the union of these two ontologies where part of the information, like Terminal Degree, University of Terminal Degree, Research Specialities, and Teaching Specialities, can be directly adopted from the domain of professors, and other information, like Supervisor, Position, and Department, can be retrieved from the ontology of company organizations. "What has been will be again, what has been done will be done again; there is nothing new under the sun." This is from Ecclesiastes 1:9, The Holy Bible.
The purpose of this ontology assembling research is to maximize the reuse of existing ontologies and minimize the work of constructing new ontologies. The figure on the right illustrate this goal. If some part of (or in the best case all of) the required formal semantics has already been built in a collection of knowledge, we want to adopt the existing semantics to avoid the work of reconstructing it from scratch. Therefore, the work becomes to determine how the system can find useful knowledge components and assemble them together to become the domain ontology that describes the information in a web page presented by the user.

The collection of knowledge contains pre-used ontologies, ontology components, and data frames. In addition, the collection of knowledge also contains pre-used mapping information. Our policy is, however, that we are not going to enforce any intergration of the knowledge within the collection unless it is demanded by a process. Therefore, the collection keeps "open-minded" instead of "stubborn". When a web page is presented, the system selects components of existing formal semantics using their data recognition mechanisms. Then we both use existing mapping information and apply new mapping procedure to assemble these components together to be a new domain ontology. During the process, when there are some knowledge that is not originally contained in the collection of knowledge, the system will link users to an manual ontology generation tool so that they can create them and add these new formal semantics into the collection of knowledge.


Collaborators



This is an on-going research project that I take the responsibility. Grateful to the researchers with whom I collaborate and from whom I take advices for this research project.

David W. Embley (Computer Science, BYU)
Stephen W. Liddle (Bussiness School, BYU)
Deryle W. Lonsdale (Linguistic, BYU)
Yuri A. Tijerino (Applied Media Informatics, Kwansei Gakuin University, Japan)
Troy Walker (Google)
Alan Wessman (CS graduate, BYU)
Li Xu (Computer Science, U. of Ariziona, South)

Other Research Topics



For interested readers who would like to explore more on this topic with me, send me email at ding@cs.byu.edu. Also, there are some other research projects in which I have involved or am involving.
Last updated: Apr 10th, 2006