DanQingShanShui (483K)

Ontology-based Data Extraction

ontos (2K) The ontology-based data extraction tool, named Ontos, is a long-lasting research project studied in the Data Extraction Research Group in Brigham Young University since the middle of 90s. This project is a union of many smaller project, which includes, such as OSM conceptual modeling, data-extraction ontology development, record boundary detection, table recognition, deep web information extraction, and several others. The use of this research products also leads to many related research projects, such as schema and ontology mapping, and semantic annotation. Until now, this project has been supported by two NSF funds, namely the TIDIE and TANGO.

The paper "Towards Ontology Generation from Tables", which was published in WWW Journal, August 2005, is a detailed description of the TANGO framework. There are also some other research projects in which I have involved or am involving.
Project Description


Most traditional data extraction studies rely on the structure of presentation features of the data within a document to generate rules or patterns to perform extraction. However, extraction can be accomplished by relying directly on the data. Given a specific domain applications, an ontology can be used to locate constants present in the page and to construct objects with them. [1]

The BYU Ontos is the most representative ontology-based data extraction tool [2]. In this tool, ontologies are previously constructed to describe the data of interest, including relationships, lexical appearance, and context keywords. By parsing this ontology, the tool can automatically produce a database by recognizing and extracting data present in documents or pages given as input. Prior to the application of the ontology, the tool requires the application of automatic procedure to extract chunks of text containing data items of interest (i.e. individual records).

The performance of extraction results are quite relied on the quality of data-extraction ontologies. If the ontology is representative enough, the extraction is fully automated. Furthermore, wrappers generated according to such an approach are inherently resilient (i.e., they continue to work properly even if the formatting features of the source pages change) and adaptable (i.e., they work for pages from many distinct sources belonging to a same application domain). According to the survey [1], these features are unique to ontology-based data extraction approaches.

For readers who are interested in how Ontos works, there are online demos they can play with.

References:

[1] A.H.F. Laender, B.A. Ribeiro-Neto, A.S. da Silva, and J.S. Teixeira. A Brief Survey of Web Data Extraction Tools, SIGMOD Record, 31(2):84-93, June 2002.

[2] D.W. Embley, D.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, and R.D. Smith, Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering, 31(3):227-251, November 1999.

Collaborators


This is a long-lasting project that has been studied for the last decade. I had particiated in both TIDIE and TANGO. I also was one of the major members who had involved in the design of TANGO. Every member in the BYU Data-extraction Research Group has more or less been part of this whole ontology-based data extraction project. Interested readers can find all the current and previous members listed in the DEG web site.

Other Research Topics



For interested readers who would like to explore more on this topic with me, send me email at ding@cs.byu.edu. Also, there are some other research projects in which I have involved or am involving.
Last updated: Sep 15th, 2005