PI: David W. Embley, Computer Science Department
Co-PI: Douglas M. Campbell (deceased)
Co-PI: Stephen W. Liddle, School of Accountancy and Information
Systems
Co-PI: Deryle G. Lonsdale, Linguistics Department
Co-PI: Yiu Kai (Dennis) Ng, Computer Science Department
Brigham Young University
Prof. David W. Embley
Computer Science Department
Brigham Young University
Provo, UT 84602
Phone: 801-422-6470
Fax: 801-378-7775
Email: embley@cs.byu.edu
Ontology-based data and information extraction and integration; source/target matching and mapping.
With ever-growing volumes of data in widely varying formats, there is a need to sift and funnel information to users to meet their own specific requirements. This project addresses the challenge of finding, extracting, and delivering appropriate data by developing a versatile framework that is target-based (i.e., based on a user's description of the desired information) and document-independent (i.e., robust, not failing whenever documents change or when new documents of interest are encountered). A combination of document-related clues regarding textual content as well as geometrical and organizational layout enables processing across various document formats. Developers and users specify areas of interest via descriptive ontologies (i.e., declarations of information types and concept relationships). These ontologies facilitate reformulating, matching, and merging retrieved information. The result of these efforts will be a comprehensive infrastructure to extract expertly, organize automatically, and summarize succinctly critical information in a queriable personalized view. An online repository will contain research results, downloadable software (including source code), and a Web interface enabling user access to the various tools and engines developed. Potentially, this technology can be embedded in personal agents; leveraged in customized search, filtering, and extraction tools; and used to provide tailored views of data via integration, organization, and summarization.
Within the area of wrapper generation, our work on ontology-based information extraction has become widely known. Our unique contribution is to produce resilient application-specific wrappers that do not "break" when a Web site changes and that apply, without alteration, to new Web sites for the application that come on line. Our recall and precision results are as good as, and usually better than, the results obtained by others working in the area.
We believe that our contribution to schema matching will also have an impact. Currently, we are just beginning to publish some of these results. In comparison with others, we provide two contributions: (1) Our automatically generated direct schema-element matching procedures yield higher accuracy than any others we have seen. (2) We are able to produce some automatically generated indirect schema-element matches.
Our contributions include the following:
We presented an approach for recognizing which multiple-record Web documents apply to an ontology. Once an application ontology is created, we can train a machine-learning algorithm over a triple of heuristics (density, expected-values, grouping) to produce a decision tree that accurately recognizes multiple-record documents for the ontology. Results for the tests we conducted showed that the F-measures were above 95% with recall and precision above 90% for both of our applications.
We presented a framework for discovering direct matches between sets of source and target attributes. In our framework, multiple facets each contribute in a combined way to produce a final set of matches. Facets considered include terminological relationships such as synonyms and hypernyms, data-value characteristics such as variances and string lengths, expected values as declared by target regular-expression recognizers, and structural characteristics based on scheme graphs.
The results are encouraging and show that the multifaceted approach to exploiting metadata for attribute matching has promise. When we used all four facets for our car-ads tests we obtained recall and precision results above 90%. Recall and precision dropped off when we reduced the number of facets to either single independent facets or to our terminological facet followed by our structural test.
We described our domain-independent approach for automatically retrieving the data behind a given Web form. We have prototyped a synergistic tool that brings the user into the process when an automatic decision is hard to make. We use a two-phase approach to gathering data: first we sample the responses from the Web site of interest, and then, if necessary, we methodically try all possible queries (until either we believe we have arrived at a fixpoint of retrieved data, or we have reached some other stopping threshold, or we have exhausted all possible queries).
We have created a prototype (mostly in Java, but also using JavaScript, PHP, and Perl) to test our ideas, and the initial results are encouraging. We have been successful with a number of Web sites, and we are continuing to study ways to improve our tool.
We described an algorithmic process to automatically identify and extract records found in microfilmed genealogical tables. Our table-processing algorithm accepts input as an XML file describing the individual cells of a genealogical table taken from microfilm, and it produces SQL statements to insert the coordinates of a table's value cells into a database. Two key features drive the algorithm: (1) geometric layout and (2) label matching with respect to a given genealogical ontology. Our algorithm succeeds when it organizes the table's cells described in the input XML file into appropriate records and creates SQL statements to insert them into a database. Our algorithm operates in three steps. (1) It extracts features from the cells described in an XML input file. (2) The algorithm applies correlation rules that relate and update the collected evidence. (3) The algorithm produces records for human users to verify and SQL insert statements that enter the coordinates of table cells into a relational database. The algorithm achieved a precision of 93%, a recall of 92%, and an accuracy of 92% on the database fields it populated on our test corpus of microfilm tables.
Our objective for TIDIE is to extract information from data-rich, semistructured documents and structure the information with respect to a given target description. Our concept of a semistructured document encompasses the notion of semistructured data, which [ABS00] defines as being "schemaless" but "self-describing" and representable by a variant of OEM (the Object Exchange Model [CGMH+94, ACC+97, MAG+97]). Starting with this notion of data semistructuredness, we enlarge it to include any document where self-descriptive clues have two properties: (1) they are sufficient to match attributes and values and (2) they are sufficient to allow these attribute-value pairs to be assembled into meaningful chunks of information representable by OEM. Semistructured documents run from the high end, where attribute-value pairs and their organization are given, to the low end, where the clues are subtle and depend on a high degree of human understanding to assemble and organize attribute-value pairs. In TIDIE we exploit these human-understandable, self-descriptive clues to classify atomic data values and to organize molecular record structures. Further, we seek to exploit these clues in a document-independent way, so that our techniques apply robustly over the full range of semistructured documents.
We classify the particular self-descriptive clues we wish to exploit in TIDIE as being linguistic, geometric, ontological, and metatextual.
TIDIE project assumptions make our task tractable, but we are careful that our assumptions do not unreasonably diminish the range of applicability. We assume that the target descriptions are ontologically narrow and that the documents we process are data rich and semistructured. These three notions defy a precise definition, but are bounded as follows.
Target descriptions are ontologically narrow if the conceptual-model instance describing objects, relationships, and constraints is "small." "Small" means that the conceptual model has a half dozen to a few dozen object sets (attributes), about the same number (or a few more) relationship sets (connections among the attributes), and several dozen constraints. The conceptual models we used in our initial experiments were ontologically narrow.
Documents are data rich if they contain "many" attribute-value pairs. "Many" means that we can populate at least a few (say a half dozen or more) attribute-value pairs and their relationships in an ontologically narrow target description.
Documents are semistructured if they contain sufficient linguistic, geometric, ontological, and metatextual clues to allow human readers to extract atomic attribute-value pairs and organize them into molecular record structures. Documents that are chaotic, ambiguous, or contain literary imagery are outside the scope of TIDIE.
Future directions:
As a "vision of possibility," we can see that the technology we are developing can be embedded in personal agents; in customized search, filtering, and extraction tools; and in tools that provide individually tailored views—integrated, organized, and summarized to meet individual or organizational needs.
Our work differs fundamentally from the approach others have taken, basically because we provide a document-independent target description. The most common approach to information extraction from the Web has been through page-specific wrappers, written by hand [CGMH+94, AM97, GHR97] or written using a variety of techniques, including hand-written with the aid of a toolkit [SA99], hand-coded specialized grammars [ACC+97], wrapper generators based on HTML and other formatting information [AK97, HGMC+97], page grammars [AMM97], landmark grammars [MMK98], concept definition frames [SL97], or some form of supervised learning [Ade98, AK97, DEW97, KWD97, Sod97, Fre98, CDF+98]. A disadvantage of these wrapper-generation techniques is the work required to create the initial wrapper (a disadvantage we also share in the sense that we have to create a target description), and the rework required to update the wrapper when the source document changes (a disadvantage we do not share).
The approach of [SL97] using "concept definition frames" and [CDF+98] using "an ontology describing classes and relations" are closest to our approach. Our notion of a "data frame" [Emb80] is similar to a "concept definition frame" in [SL97], but embodies a richer description of the data to be recognized and extracted. Our notion of an "ontology" is similar to an "ontology" in [CDF+98], but goes much further in describing the application of interest. The work reported in [Bri98] is also similar to ours in the sense that it is robust with respect to source document changes. The technique in [Bri98], which extracts author/title pairs, requires very little supervision for the machine-learning approach it takes, and need not be altered for new pages or when pages change. This approach, however, appears to be limited to very small, tightly coupled application domains such as the author/title pairs for which it was used.
Another approach that has been used for information extraction is natural language processing (NLP) [LCF+94, CL96, Sod97]. NLP approaches use tokenization, part-of-speech and sense tagging, building syntactic and semantic structures and relationships, and producing a coherent framework for extracted information fragments. Our work does not attempt to understand text in the deep NLP sense; consequently it does not depend upon sentential elements (as deep NLP approaches do), which are often missing for Web pages of classified ads and for partially formatted data found in forms and census records.
Our approach uses a specific target description, but we are not the only ones who have suggested target descriptions. With a somewhat similar objective in mind, [DMRA97, MD99, DM99] present Structured Maps as a modeling construct imposed over Web information sources. Similar to our target description, a semantic model is used to provide a scheme over a domain of interest, which is then populated with information elements from the Web. In another effort with a similar objective, [AMM97] introduces a data model to describe the scheme for a user view over information on the Web along with a set of languages for synthesizing the scheme for a particular application and to manage and restructure data with respect to the scheme. Our work differs from these other efforts because they do not attempt to populate their model instances automatically, populating them instead by hand or with the aid of tools.
| [ABS00] | S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, California, 2000. |
| [ACC+97] | S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and Jerome Simeon. Querying documents in object databases. International Journal on Digital Libraries, 1(1):5–19, April 1997. |
| [Ade98] | B. Adelberg. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 283–294, Seattle, Washington, June 1998. |
| [AK97] | N. Ashish and C. Knoblock. Semi-automatic wrapper generation for Internet information sources. In Proceedings of the CoopIS'97, 1997. |
| [AM97] | P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the 16th ACM PODS, pages 144–153, May 1997. |
| [AMM97] | P. Atzeni, G. Mecca, and P. Merialdo. To weave the Web. In Proceedings of the Twenty-third International Conference on Very Large Data Bases, pages 206–215, Athens, Greece, August 1997. |
| [Bri98] | S. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the WebDB Workshop (at EDBT'98), 1998. |
| [CDF+98] | M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98), pages 509–516, Madison, Wisconsin, July 1998. |
| [CGMH+94] | S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J Widom. The TSIMMIS project: Integration of heterogeneous information sources. In IPSJ Conference, pages 7–18, Tokyo, Japan, October 1994. |
| [CL96] | J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, January 1996. |
| [DEW97] | R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pages 39–48, Marina Del Rey, California, February 1997. |
| [DM99] | L. Delcambre and D. Maier. Models for superimposed information. In P.P. Chen, D.W. Embley, J. Kouloumdjian, S.W. Liddle, and J.F. Roddick, editors, Advances in Conceptual Modeling: Proceedings of the Workshop on the World Wide Web and Conceptual Modeling (WWWCM'99), volume LNCS 1727, pages 264–280, Paris, France, November 1999. Springer Verlag. |
| [DMRA97] | L.M.L. Delcambre, D. Maier, R. Reddy, and L. Anderson. Structured maps: Modeling explicit semantics over a universe of information. International Journal on Digital Libraries, 1(1):20–35, April 1997. |
| [ECJ+98] | D.W. Embley, D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith, S.W. Liddle, and D.W. Quass. A conceptual-modeling approach to extracting data from the Web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER'98), pages 78–91, Singapore, November 1998. |
| [Emb80] | D.W. Embley. Programming with data frames for everyday data items. In Proceedings of the 1980 National Computer Conference, pages 301–305, Anaheim, California, May 1980. |
| [Fel98] | C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachussets, 1998. |
| [Fre98] | D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of AAAI/IAAI, pages 517–523, 1998. |
| [GHR97] | A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. SIGMOD Record, 26(4):57–61, December 1997. |
| [HGMC+97] | J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. |
| [Hom00] | Home Page for BYU Data Extraction Group, 2000. URL: http://www.deg.byu.edu. |
| [KWD97] | N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, pages 729–735, 1997. |
| [LCF+94] | W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland. Evaluating an information extraction system. Journal of Integrated Computer-Aided Engineering, 1(6), 1994. |
| [MAG+97] | J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54–66, September 1997. |
| [MD99] | D. Maier and L. Delcambre. Superimposed information for the Internet. In S. Cluet and T. Milo, editors, Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'99), Philadelphia, Pennsylvania, June 1999. |
| [MMK98] | I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured, Web-based information sources. In Proceedings of AAAI'98: Workshop on AI and Information Integration, Madison, Wisconsin, July 1998. |
| [NM00] | U.Y. Nahm and R.J. Mooney. A mutually beneficial integration of data mining and information extraction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00), Austin, Texas, 2000. Submitted. |
| [SA99] | A. Sahuguet and F. Azavant. Looking at the Web through XML glasses. In Proceedings of the Fourth International Conference on Cooperative Systems (CoopIS'99), Edinburgh, Scotland, UK, September 1999. |
| [SL97] | D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997. |
| [Sod97] | S. Soderland. Learning to extract text-based information from the World Wide Web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 251–254, Newport Beach, California, August 1997. |
| [SP90] | G. Shafer and J. Pearl, editors. Readings in Uncertain Reasoning. Morgan Kaufmann, Los Altos, California, 1990. |