DanQingShanShui (483K)

Semi-Automatic Domain Ontology Generation
by Reusing A Large-Scale Ontology

building (1K) With the emerging of the semantic web, researchers become more interested in the method about ontology generation; especially for creating new domain ontologies through ontology reuse. As we know, it is usually very difficult for people to build ontologies from scratch. Although many researchers have tried to automate the process of ontology generation by several different methodolgoies, such as machine learning and natural language processing, until now there are still no practically satisfactory resolutions available. In the near future, we may expect that the majority work of ontology construction still needs to depend on humans to create manually. At the same time, however, when more and more ontologies are available, it is reasonable to ask whether we could try to reuse the ontology components that have been pre-created so that we can reduce the load for ontology engineers. This research is such a case study. In this work, we have tried to build several small domain ontologies for individual web pages by reusing a large-scale, generic ontology---the Mikrokosmos ontology. Our methodology shows that it is now possible to automate this ontology reuse process, and we may reduce the load of ontology generation from scratch by as much as 50%.

This is a finished project. There are also some other research projects in which I have involved or am involving.
Project Publlications



[3] Generating Domain Ontologies by Examples through Automated Ontology Reuse by Yihong Ding, Deryle W. Lonsdale, David W. Embley, Martin Hepp, and Li Xu. (submitted for review)

[2] Semi-automatic Generation of Resilient Data-Extraction Ontologies by Yihong Ding, Master Thesis, Brigham Young University, Provo, Utah, June 2003.

[1] Peppering Knowledge Sources with SALT; Boosting Conceptual Content for Ontology Generation by Deryle W. Lonsdale, Yihong Ding, David W. Embley, and Alan Melby, In Proceedings of the AAAI Workshop for Semantic Web Meets Language Resources, pp.30-36, Edmonton, Alberta, Canada, July 28, 2002.

Project Description


The architecture figure illustrates the method we use to build a domain ontology from existing knowledge sources, which in our case the core is the Mikrokosmos ontology---a well designed large generic ontology [1]. Given some training documents, the system applies a concept-selection process, a relationship-set-selection process, and a constraint-discovery process in sequence to create a domin ontology, which in our experiments is represented in the form of OSM data-extraction ontology. After an ontology is generated, users may evaluate the performance of the generated ontology by performing the generated ontology to extract data from the test documents and watch precison and recall values. During the evaluating and tuning stage, users may interrupt at any time to update the generated ontology until the final results fulfills their expectations.

semiontogene (78K)


The primary knowledge source in the united knowledge base is the Mikrokosmos ontology---a well designed "broad-coverage" ontology [1]. It contains over 5000 concepts, and each concept connects to an average of 14 other concepts through relationships represented in its slots. The ontology is represented in XML syntax.

In addition to the primary knowledge source, we have three additionl knowledge sources to compose the knowledge base. The data-frame library provides a collection of data frames, which describes the extensions of many individual concepts. The WordNet [2] lexicons are adopted to be a synonym dictionary that links multiple knowledge sources together. Also, there are some additional lexicons that complement the Mikrokosmos ontology. For examples, lists of abbreviations and acronyms.

The integration of knowledge sources are processed through a human-guided semi-automatic way. We only need to process this integration once, and it is separate to the run-time ontology-generation process. The purpose of this integration is to align the concepts defined in the Mikrokosmos ontology to their extensions, if any, and therefore a run-time data recognition process can directly process them to identify the potential instaniations of each concept within training documents.
topkb (70K)


training (49K) Our basic assumption on training documents are that they contain rich information and they are narrow on the expected domain. Otherwise, if training documents are not data-rich, we may face the difficulty of not being to find data and thus not being able to generate an ontology that covers the data of interest. If they are not narrow in topic breadth, the generated ontology may face the problem of trying to link highly unrelated concepts.

After we have a set of training documents, we need first pre-process these documents by (1) removing unrelated protion of document content, and (2) decomposing a document to multiple records, if applicable. Again, the purpose of these operations is to emphasize the scope of the expected domain and clear as much as possible on any other information that may lead mistakes. Each individual record thus represents a subpart of the domain of interest.


The ontology generation procedure contains three steps to create concepts, relationships, and participation constraints. First, the system applies the extensional level knowledge in the integrated knowledge base to find candidate concepts according to their instantiations within the training records. Next, the system uses three domain-independent heuristics to solve conflicts happened during the concept-recognition process.

After the conflict resolution process, the system applies a FIND-BINARY algorithm to retrieve binary relationships among selected concepts out of the knowledge structure in the knowledge base. The core of the FIND-BINARY algorithm is based on Dijkstra's well-known algorithm on find the shortest path between two nodes in a graph. The results of the FIND-BINARY process may generate more than one connected subgraph. Based on the data-rich assumption, only the largest connected subgraph may represent the correct domain of interest. And we can further eliminate the other unrelated concepts out of the scope. After this process, the system also applies an (optional) n-ary relationship detection procedure to combine some closely overlapped binary relationship sets to become n-ary relationship sets, which can illustrate the domain more directly for human to understand.

After it has generated concepts and their relationship sets, the system figures out the participation constraints that may appear in each relationship set using the instantiation numbers of each concept in multiple training records. The more representative these training records are, the more correct the system can create these participation constraints.

The system outputs the final version of automatically created ontology to an editable user-interface. Users can inspect the ontology and manually modify it. Our goal for the automatic generation part is to avoid as much as possible the requirements of adding new information into a created ontology . In general, it is much easier to delete or modify a generated entity set or relationship set than to add a totally new entity set or relationship set into an ontology.

An interesting application of using this tool is that we add a new knowledge source, SALT, that contains both new concepts and new relationship sets into the knowledge base. In addition, we add the Eurodicautom terminology bank as our termbase. This terminology bank consists of over a million concept entries covering a wide range of topic. Each entry is multilingual in character, containing equivalents in any of several languages.

With the new augmented knowledge base, we ran the system on various U.S. Department of Energy (DOE) abstracts to test the performance of our system. Although the final results contains some spurious relationship sets because DOE abstracts cover a wide range of topics, there is an encouraging observation. Because of the use of multiple ontologies, the system is able to suggest some brand new relationship sets, which are also reasonably correct, that are not contained in any of the original knowledge sources.
salt (33K)


More details of this projects and experimental discussions are in my thesis, which is available online freely for interested readers [3].

References:

[1] K. Mahesh. Ontology Development for Machine Translation: Ideology and Methodology. Technical Report MCCS-96-292, Computer Research Laboratory, New Mexico State University, 1996.

[2] G.A. Miller. WordNet: A Lexical Database for English. Communication of the ACM, 38(11):39-41, November 1995. URL: http://www.cogsci.princeton.edu/~wn/

[3] Yihong Ding. Semiautomatic Generation of Resilient Data-Extraction Ontologies. Master Thesis. Brigham Young University. June 2003. URL: http://www.deg.byu.edu/papers/ThesisMasterYihong.ps

Lessons and Discussion



This project presents a knowledge-reusing way of automatic generating domain ontologies. We believe that this is an important direction on future semantic web ontology generation research. In the semantic web scenario, users usually need narrow scaled knowledge representations but with very flexible and dynmaic requests. However, to create formal ontologies is a highly technical job that require solid knowledge on ontology theories. It is thus very helpful if we can develop a tool that can be used by normal users and it can produce ontologies according users' specific requirements and uses pre-created knowledge that is built by professional ontologists. This type of tools may allow normal users, even program developers that do not have any ontology background, to develop semantic web applications easily. Therefore, it helps to enable the dream of the semantic web.

Some results of this semi-automatic ontology generation tool is not bad. But in general, however, this tool still cannot fulfill many requests. There are four lessons and experiences we had learned in this study.


Collaborators



This is a finished project for my master's thesis. Grateful to the researchers with whom I had collaborated and from whom I took advices.

David W. Embley (Computer Science, BYU)
Deryle W. Lonsdale (Linguistic, BYU)
Li Xu (Computer Science, University of Arizona South)


Other Research Topics



For interested readers who would like to explore more on this topic with me, send me email at ding@cs.byu.edu. Also, there are some other research projects in which I have involved or am involving.
Last updated: Sep 30th, 2006