IDM IIS-0083127 Grant Report 2002

Target-Based Document-Independent Information Extraction

PI: David W. Embley, Computer Science Department
Co-PI: Douglas M. Campbell (deceased)
Co-PI: Stephen W. Liddle, School of Accountancy and Information Systems
Co-PI: Deryle G. Lonsdale, Linguistics Department
Co-PI: Yiu Kai (Dennis) Ng, Computer Science Department

Brigham Young University

Contact Information

Prof. David W. Embley
Computer Science Department
Brigham Young University
Provo, UT 84602
Phone: 801-422-6470
Fax: 801-378-7775
Email: embley@cs.byu.edu

Project Web Page

www.deg.byu.edu

Project Award Information

Award Number: IIS-0083127
Duration: 04/01/2001 - 03/31/2004
Title: Target-Based Document-Independent Information Extraction

Keywords

Ontology-based data and information extraction and integration; source/target matching and mapping.

Student Involvement

Currently we have one PhD student and eight Masters students working on the project.
Since the award was granted, two students working on the project have graduated.
Since the award, students have participated as coauthors of several published and submitted papers.
- Students are coauthors of five published papers.
- Students are coauthors of five submitted papers.
Three students in our research group are female. One is seeking a PhD; the other two are seeking MS degrees.

Project Summary

With ever-growing volumes of data in widely varying formats, there is a need to sift and funnel information to users to meet their own specific requirements. This project addresses the challenge of finding, extracting, and delivering appropriate data by developing a versatile framework that is target-based (i.e., based on a user's description of the desired information) and document-independent (i.e., robust, not failing whenever documents change or when new documents of interest are encountered). A combination of document-related clues regarding textual content as well as geometrical and organizational layout enables processing across various document formats. Developers and users specify areas of interest via descriptive ontologies (i.e., declarations of information types and concept relationships). These ontologies facilitate reformulating, matching, and merging retrieved information. The result of these efforts will be a comprehensive infrastructure to extract expertly, organize automatically, and summarize succinctly critical information in a queriable personalized view. An online repository will contain research results, downloadable software (including source code), and a Web interface enabling user access to the various tools and engines developed. Potentially, this technology can be embedded in personal agents; leveraged in customized search, filtering, and extraction tools; and used to provide tailored views of data via integration, organization, and summarization.

Project Impact

Within the area of wrapper generation, our work on ontology-based information extraction has become widely known. Our unique contribution is to produce resilient application-specific wrappers that do not "break" when a Web site changes and that apply, without alteration, to new Web sites for the application that come on line. Our recall and precision results are as good as, and usually better than, the results obtained by others working in the area.

We believe that our contribution to schema matching will also have an impact. Currently, we are just beginning to publish some of these results. In comparison with others, we provide two contributions: (1) Our automatically generated direct schema-element matching procedures yield higher accuracy than any others we have seen. (2) We are able to produce some automatically generated indirect schema-element matches.

Our contributions include the following:

Recognition of Web Documents Applicable to an Application Ontology
We presented an approach for recognizing which multiple-record Web documents apply to an ontology. Once an application ontology is created, we can train a machine-learning algorithm over a triple of heuristics (density, expected-values, grouping) to produce a decision tree that accurately recognizes multiple-record documents for the ontology. Results for the tests we conducted showed that the F-measures were above 95% with recall and precision above 90% for both of our applications.
Multifaceted Attribute Matching
We presented a framework for discovering direct matches between sets of source and target attributes. In our framework, multiple facets each contribute in a combined way to produce a final set of matches. Facets considered include terminological relationships such as synonyms and hypernyms, data-value characteristics such as variances and string lengths, expected values as declared by target regular-expression recognizers, and structural characteristics based on scheme graphs.

The results are encouraging and show that the multifaceted approach to exploiting metadata for attribute matching has promise. When we used all four facets for our car-ads tests we obtained recall and precision results above 90%. Recall and precision dropped off when we reduced the number of facets to either single independent facets or to our terminological facet followed by our structural test.
Hidden Web Crawling
We described our domain-independent approach for automatically retrieving the data behind a given Web form. We have prototyped a synergistic tool that brings the user into the process when an automatic decision is hard to make. We use a two-phase approach to gathering data: first we sample the responses from the Web site of interest, and then, if necessary, we methodically try all possible queries (until either we believe we have arrived at a fixpoint of retrieved data, or we have reached some other stopping threshold, or we have exhausted all possible queries).

We have created a prototype (mostly in Java, but also using JavaScript, PHP, and Perl) to test our ideas, and the initial results are encouraging. We have been successful with a number of Web sites, and we are continuing to study ways to improve our tool.
Recognition of Records in Microfilm Documents
We described an algorithmic process to automatically identify and extract records found in microfilmed genealogical tables. Our table-processing algorithm accepts input as an XML file describing the individual cells of a genealogical table taken from microfilm, and it produces SQL statements to insert the coordinates of a table's value cells into a database. Two key features drive the algorithm: (1) geometric layout and (2) label matching with respect to a given genealogical ontology. Our algorithm succeeds when it organizes the table's cells described in the input XML file into appropriate records and creates SQL statements to insert them into a database. Our algorithm operates in three steps. (1) It extracts features from the cells described in an XML input file. (2) The algorithm applies correlation rules that relate and update the collected evidence. (3) The algorithm produces records for human users to verify and SQL insert statements that enter the coordinates of table cells into a relational database. The algorithm achieved a precision of 93%, a recall of 92%, and an accuracy of 92% on the database fields it populated on our test corpus of microfilm tables.

Publications and Products

Direct and Indirect matching of Schema Elements for Data Integration on the Web by Li Xu and D.W. Embley, submitted. (135K .pdf)

Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure by D.W. Embley, C. Tao, and S.W. Liddle submitted. (422K .pdf)

Representing and Querying Semistructured Web Data Using Nested Tables with Structural Variants, by I.M.E. Filha, A.S. da Silva, A.H.F. Laender, and D.W. Embley, submitted. (210meg .pdf)

Extracting Data Behind Web Forms by S.W. Liddle, D.W. Embley, D.T. Scott, and S.H. Yau, submitted. (245K .pdf)

Attribute Match Discovery in Information Integration: Exploiting Multiple Facets of Metadata by D.W. Embley, David Jackman, and Li Xu, submitted. (390K .ps)

Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Views, by J. Biskup and D.W. Embley, to appear, (438K .pdf, 958K .ps)

Using Nested Tables for Representing and Querying Semistructured Web Data, by I.M.E. Filha, A.S. da Silva, A.H.F. Laender, and D.W. Embley, The Fourteenth International Conference on Advanced Information Systems Engineering 27-31 May 2002, to appear. (210meg .pdf)

Efficiently Querying contradictory and Uncertain Genealogical Data by L.E. Olson and D.W. Embley, Proceedings of the Second Annual Workshop on Technology for Family History and Genealogical Research Brigham Young University, 4 April 2002. (77K .pdf)

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables by Kenneth M. Tubbs, Masters Thesis, December 2001. (2.3M .pdf)

Automating the Extraction of Data Behind Web Forms by Sai Ho (Tony) Yau. Masters Thesis, December 2001. (1.4M Word)

A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Model Retrieval Model by Quan Wang, Masters Thesis. (16.8M .ps)

A Probabilistic Model for Binary Categorization of Multiple-Record Web Documents by June Tang, Masters Thesis. (793K .ps)

On the Automatic Extraction of Data from the Hidden Web by S.W. Liddle, S.H. Yau, and D.W. Embley, Proceedings of the International Workshop on Data Semantics in Web Information Systems (DASWIS-2001), Yokohama, Japan, 27-30 November 2001. (181K .pdf)

Recognizing Ontology-Applicable Multiple-Record Web Documents, by D.W. Embley, Y.-K. Ng, and L. Xu, Proceedings of the 20th International Conference on Conceptual Modeling (er2001), Yokohama, Japan, 27-30 November 2001. (2.1meg .pdf, 7.0meg .ps)

Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration, by D.W. Embley, D. Jackman, and L. Xu, Proceedings of WIIW01, Rio de Janeiro, Brazil, 9-11 April 2001. (108K .pdf, 247K .ps)

Our Earlier, Related Papers

Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents, by D.W. Embley and L. Xu, LNCS 1997 (787K .ps)

Mediated Information Gain, by J. Biskup and D.W. Embley, International Database Engineering and Application Symposium, (179K .ps)

Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents, by D.W. Embley and L. Xu, WebDB'00 Proceedings (603K .ps)

Demonstration: A Robust Web Data-Extraction Technique With High Recall and Precision, DEG Technical Report, (202K .pdf, 1,803K .ps)

Conceptual-Model-Based Data Extraction from Multiple-Record Web Documents, by D.W. Embley, E.M. Campbell, Y.S. Jiang, S.W. Liddle, D.W. Lonsdale, Y.-K. Ng, and R.D. Smith, Data and Knowledge Engineering, 11/99 (227K .pdf, 425K .ps)

Automatically Extracting Structure and Data from Business Reports, by S.W. Liddle, D.M. Campbell, and C. Crawford, CIKM'99 Proceedings (186K .pdf, 385K .ps)

Ontology Suitability for Uncertain Extraction of Information from Multi-Record Web Documents, D.W. Embley, N. Fuhr, C.-P. Klas, and T. Roelleke, ADI'99 Proceedings (111K .pdf, 255K .ps)

Record-Boundary Discovery in Web Documents, by D.W. Embley, Y.S. Jiang, and W.-K. Ng, SIGMOD'99 Proceedings (248K .pdf, 313K .ps)

A Conceptual-Modeling Approach to Extracting Data from the Web, by D.W. Embley, D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith, S.W. Liddle, and D.W. Quass, ER'98 Proceedings (173K .pdf, 394K .ps)

Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents, by D.W. Embley, D.M. Campbell, and R.D. Smith, CIKM'98 Proceedings (170K .pdf, 323K .ps)

Goals, Objectives, and Targeted Activities

Our objective for TIDIE is to extract information from data-rich, semistructured documents and structure the information with respect to a given target description. Our concept of a semistructured document encompasses the notion of semistructured data, which [ABS00] defines as being "schemaless" but "self-describing" and representable by a variant of OEM (the Object Exchange Model [CGMH+94, ACC+97, MAG+97]). Starting with this notion of data semistructuredness, we enlarge it to include any document where self-descriptive clues have two properties: (1) they are sufficient to match attributes and values and (2) they are sufficient to allow these attribute-value pairs to be assembled into meaningful chunks of information representable by OEM. Semistructured documents run from the high end, where attribute-value pairs and their organization are given, to the low end, where the clues are subtle and depend on a high degree of human understanding to assemble and organize attribute-value pairs. In TIDIE we exploit these human-understandable, self-descriptive clues to classify atomic data values and to organize molecular record structures. Further, we seek to exploit these clues in a document-independent way, so that our techniques apply robustly over the full range of semistructured documents.

We classify the particular self-descriptive clues we wish to exploit in TIDIE as being linguistic, geometric, ontological, and metatextual.

Linguistic Clues: lexical data values, lexical attributes, and lexical context. We can sometimes classify data values, such as vehicle-identification numbers (VINs), dates, Social-Security Numbers, dollar amounts, and university course numbers, without the aid of accompanying attribute designators. Attributes and lexical context, however, are needed when there can be ambiguity such as when integers, reals, dollar amounts, or dates play different roles in a document.
Geometric Clues: patterned layout including row alignment, column alignment, nested indentation, page headers, and page footers. Row and column alignment and nested indentation provide clues for organizing attribute-value pairs into record structures. Linguistic and geometric clues together are often sufficient to permit attribute-value pairing, especially for forms and tables.
Ontological Clues: objects, relationships, cardinalities, generalization/specialization, aggregation, and general constraints. Knowing about real-world objects and their relationships and constraints can aid in both recognition and organization of data values. Ontological expectations embodied in "is-a" and "part-of" relationships as well as general constraints can limit the possible choices for attribute-value pairs. Ontological organization guides record construction.
Metatextual Clues: including punctuation, italics, bold, underlining, arrows, pointing fingers, lines, and boxes. Punctuation, such as sentence-boundary designations, and bounding lines and boxes limit the scope of context and help prevent spurious connections between attributes and values. Metatextual emphasis aids in distinguishing more important from less important attributes; boundaries aid in sorting out ambiguities.

TIDIE project assumptions make our task tractable, but we are careful that our assumptions do not unreasonably diminish the range of applicability. We assume that the target descriptions are ontologically narrow and that the documents we process are data rich and semistructured. These three notions defy a precise definition, but are bounded as follows.

Target descriptions are ontologically narrow if the conceptual-model instance describing objects, relationships, and constraints is "small." "Small" means that the conceptual model has a half dozen to a few dozen object sets (attributes), about the same number (or a few more) relationship sets (connections among the attributes), and several dozen constraints. The conceptual models we used in our initial experiments were ontologically narrow.

Documents are data rich if they contain "many" attribute-value pairs. "Many" means that we can populate at least a few (say a half dozen or more) attribute-value pairs and their relationships in an ontologically narrow target description.

Documents are semistructured if they contain sufficient linguistic, geometric, ontological, and metatextual clues to allow human readers to extract atomic attribute-value pairs and organize them into molecular record structures. Documents that are chaotic, ambiguous, or contain literary imagery are outside the scope of TIDIE.

Future directions:

We wish to exploit all four types of clues (linguistic, geometric, ontological, and metatextual) in all our efforts. A holistic approach is likely to yield better results than using clues individually or in combinations of two's or three's.
We wish to make all heuristics, rules, and clue-processing specifications declarative. Declarative specifications pave the way (1) for easier and more active experimentation (we can alter declarative statements more easily than hard-coded procedures) and (2) for possible machine learning, where the system, rather than a human, creates the declarative statements. Although counter to prevailing thought, we question whether it is less human intensive to prepare sufficient labeled examples to train a machine-learning algorithm or less human intensive to create an ontologically narrow application ontology with the aid of software tools [Hom00]. Our anecdotal experience experience tells us that a few-dozen person hours is sufficient to create a reasonable application ontology. Further, hand-crafted application ontologies tend to have higher recall and precision (e.g. about 80% [NM00] verses about 90% [ECJ+98] on job ads). Nevertheless, we believe that an appropriate mixture of machine-learned and hand-crafted rules will be best.
item We wish to follow up on six "smaller" specific ideas that have been suggested. (1) Explore the use of WordNet [Fel98] for use in integration, (2) explore theories of evidence—such as Dempster-Shafer Theory [SP90]—for detecting object identity in integration, (3) explore the use of grammars for identifying patterns of text lines both in business reports and in tables, (4) explore the possibility of XML refinement for culling and reorganizing semistructured documents, (5) explore the possibility of learning geometric patterns for locating and rubber-banding potential attribute-value pairs in images of microfilm documents, and (6) explore the ramifications of dynamic target development that takes place synergistically during data extraction and integration.

As a "vision of possibility," we can see that the technology we are developing can be embedded in personal agents; in customized search, filtering, and extraction tools; and in tools that provide individually tailored views—integrated, organized, and summarized to meet individual or organizational needs.

Area Overview

Our work differs fundamentally from the approach others have taken, basically because we provide a document-independent target description. The most common approach to information extraction from the Web has been through page-specific wrappers, written by hand [CGMH+94, AM97, GHR97] or written using a variety of techniques, including hand-written with the aid of a toolkit [SA99], hand-coded specialized grammars [ACC+97], wrapper generators based on HTML and other formatting information [AK97, HGMC+97], page grammars [AMM97], landmark grammars [MMK98], concept definition frames [SL97], or some form of supervised learning [Ade98, AK97, DEW97, KWD97, Sod97, Fre98, CDF+98]. A disadvantage of these wrapper-generation techniques is the work required to create the initial wrapper (a disadvantage we also share in the sense that we have to create a target description), and the rework required to update the wrapper when the source document changes (a disadvantage we do not share).

The approach of [SL97] using "concept definition frames" and [CDF+98] using "an ontology describing classes and relations" are closest to our approach. Our notion of a "data frame" [Emb80] is similar to a "concept definition frame" in [SL97], but embodies a richer description of the data to be recognized and extracted. Our notion of an "ontology" is similar to an "ontology" in [CDF+98], but goes much further in describing the application of interest. The work reported in [Bri98] is also similar to ours in the sense that it is robust with respect to source document changes. The technique in [Bri98], which extracts author/title pairs, requires very little supervision for the machine-learning approach it takes, and need not be altered for new pages or when pages change. This approach, however, appears to be limited to very small, tightly coupled application domains such as the author/title pairs for which it was used.

Another approach that has been used for information extraction is natural language processing (NLP) [LCF+94, CL96, Sod97]. NLP approaches use tokenization, part-of-speech and sense tagging, building syntactic and semantic structures and relationships, and producing a coherent framework for extracted information fragments. Our work does not attempt to understand text in the deep NLP sense; consequently it does not depend upon sentential elements (as deep NLP approaches do), which are often missing for Web pages of classified ads and for partially formatted data found in forms and census records.

Our approach uses a specific target description, but we are not the only ones who have suggested target descriptions. With a somewhat similar objective in mind, [DMRA97, MD99, DM99] present Structured Maps as a modeling construct imposed over Web information sources. Similar to our target description, a semantic model is used to provide a scheme over a domain of interest, which is then populated with information elements from the Web. In another effort with a similar objective, [AMM97] introduces a data model to describe the scheme for a user view over information on the Web along with a set of languages for synthesizing the scheme for a particular application and to manage and restructure data with respect to the scheme. Our work differs from these other efforts because they do not attempt to populate their model instances automatically, populating them instead by hand or with the aid of tools.

References

[ABS00]	S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann Publishers, San Francisco, California, 2000.
[ACC+97]	S. Abiteboul, S. Cluet, V. Christophides, T. Milo, G. Moerkotte, and Jerome Simeon. Querying documents in object databases. International Journal on Digital Libraries, 1(1):5–19, April 1997.
[Ade98]	B. Adelberg. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 283–294, Seattle, Washington, June 1998.
[AK97]	N. Ashish and C. Knoblock. Semi-automatic wrapper generation for Internet information sources. In Proceedings of the CoopIS'97, 1997.
[AM97]	P. Atzeni and G. Mecca. Cut and paste. In Proceedings of the 16th ACM PODS, pages 144–153, May 1997.
[AMM97]	P. Atzeni, G. Mecca, and P. Merialdo. To weave the Web. In Proceedings of the Twenty-third International Conference on Very Large Data Bases, pages 206–215, Athens, Greece, August 1997.
[Bri98]	S. Brin. Extracting patterns and relations from the World Wide Web. In Proceedings of the WebDB Workshop (at EDBT'98), 1998.
[CDF+98]	M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98), pages 509–516, Madison, Wisconsin, July 1998.
[CGMH+94]	S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J Widom. The TSIMMIS project: Integration of heterogeneous information sources. In IPSJ Conference, pages 7–18, Tokyo, Japan, October 1994.
[CL96]	J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39(1):80–91, January 1996.
[DEW97]	R.B. Doorenbos, O. Etzioni, and D.S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the First International Conference on Autonomous Agents, pages 39–48, Marina Del Rey, California, February 1997.
[DM99]	L. Delcambre and D. Maier. Models for superimposed information. In P.P. Chen, D.W. Embley, J. Kouloumdjian, S.W. Liddle, and J.F. Roddick, editors, Advances in Conceptual Modeling: Proceedings of the Workshop on the World Wide Web and Conceptual Modeling (WWWCM'99), volume LNCS 1727, pages 264–280, Paris, France, November 1999. Springer Verlag.
[DMRA97]	L.M.L. Delcambre, D. Maier, R. Reddy, and L. Anderson. Structured maps: Modeling explicit semantics over a universe of information. International Journal on Digital Libraries, 1(1):20–35, April 1997.
[ECJ+98]	D.W. Embley, D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith, S.W. Liddle, and D.W. Quass. A conceptual-modeling approach to extracting data from the Web. In Proceedings of the 17th International Conference on Conceptual Modeling (ER'98), pages 78–91, Singapore, November 1998.
[Emb80]	D.W. Embley. Programming with data frames for everyday data items. In Proceedings of the 1980 National Computer Conference, pages 301–305, Anaheim, California, May 1980.
[Fel98]	C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachussets, 1998.
[Fre98]	D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proceedings of AAAI/IAAI, pages 517–523, 1998.
[GHR97]	A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. SIGMOD Record, 26(4):57–61, December 1997.
[HGMC+97]	J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the Web. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.
[Hom00]	Home Page for BYU Data Extraction Group, 2000. URL: http://www.deg.byu.edu.
[KWD97]	N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of the 1997 International Joint Conference on Artificial Intelligence, pages 729–735, 1997.
[LCF+94]	W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland. Evaluating an information extraction system. Journal of Integrated Computer-Aided Engineering, 1(6), 1994.
[MAG+97]	J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database management system for semistructured data. SIGMOD Record, 26(3):54–66, September 1997.
[MD99]	D. Maier and L. Delcambre. Superimposed information for the Internet. In S. Cluet and T. Milo, editors, Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB'99), Philadelphia, Pennsylvania, June 1999.
[MMK98]	I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured, Web-based information sources. In Proceedings of AAAI'98: Workshop on AI and Information Integration, Madison, Wisconsin, July 1998.
[NM00]	U.Y. Nahm and R.J. Mooney. A mutually beneficial integration of data mining and information extraction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00), Austin, Texas, 2000. Submitted.
[SA99]	A. Sahuguet and F. Azavant. Looking at the Web through XML glasses. In Proceedings of the Fourth International Conference on Cooperative Systems (CoopIS'99), Edinburgh, Scotland, UK, September 1999.
[SL97]	D. Smith and M. Lopez. Information extraction for semi-structured documents. In Proceedings of the Workshop on Management of Semistructured Data, Tucson, Arizona, May 1997.
[Sod97]	S. Soderland. Learning to extract text-based information from the World Wide Web. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 251–254, Newport Beach, California, August 1997.
[SP90]	G. Shafer and J. Pearl, editors. Readings in Uncertain Reasoning. Morgan Kaufmann, Los Altos, California, 1990.