Extracting and Structuring Web Data


Click here to start

Table of Contents

Extracting and Structuring Web Data

GOAL Query the Web like we query a database

PROBLEM The Web is not structured like a database.

Making the Web Look Like a Database

Automatic Wrapper Generation

Application Ontology: Object-Relationship Model Instance

Application Ontology: Data Frames

Ontology Parser

Record Extractor

Record Extractor: High Fan-Out Heuristic

Record Extractor: Record-Separator Heuristics

Record Extractor: Consensus Heuristic

Record Extractor: Results

Constant/Keyword Recognizer


Keyword Proximity

Subsumed/Overlapping Constants

Functional Relationships

Nonfunctional Relationships

First Occurrence without Constraint Violation

Database-Instance Generator

Recall & Precision

Results: Car Ads

Car Ads: Comments

Results: Computer Job Ads

Obituaries (A More Demanding Application)

Obituary Ontology

Data Frames Lexicons & Specializations

Keyword Heuristics Singleton Items

Keyword Heuristics Multiple Items

Results: Obituaries

Results: Obituaries


Author: David W. Embley

Home Page: http://osm7.cs.byu.edu/CS751R/CS751R.html