Record-Boundary Discovery in Web Documents


Table of Contents

Record-Boundary Discovery Larger Goal: Information Extraction

Desired Objective Query the Web Like a Database

Approach and Limitations Automatic Ontology-Based Wrapper Generation

Application Ontology: Object-Relationship Model Instance

Application Ontology: Data Frames

Ontology Parser

Record Extractor

Record Extractor: High Fan-Out Heuristic

Record Extractor: Record-Separator Heuristics

IT: Identifiable “html separator” Tags

HT: Highest-count Tags

SD: Standard Deviation

OM: Ontological Match

RP: Repeating-tag Patterns

Record Extractor: Consensus Heuristic

Record Extractor: Example Consensus Heuristic

Record Extractor: Results

Constant/Keyword Recognizer

Database Instance Generator

Recall & Precision

Results: Car Ads

Car Ads: Comments

Results: Computer Job Ads

Results: Obituaries



