Data Crawling by Link Semantics
- Jiuming Huang (v-jihuan @ microsoft . com)
- Haixun Wang (haixunw @ microsoft . com)
- Information is hidden in user interaction. A growing number of web sites support exploratory search. The underlying data has many features. Users explore the data by forming queries in web interaction. A detailed page that shows a record in the query result, however, only contains a subset of features. We find that on average over 80% features are embedded in user interaction, and only less than 40% are on the detailed pages. The challenge is thus to recover those missing features so that web crawling can find more complete set of features for the data.
Fig. 1. Detailed page of a product on zappos.com
Fig. 2. Attribute/values of the product (those in red are missing in the detailed page.)
- Traditional crawling is wasteful. Traditional crawling focuses on individual web pages, or on how web sites present their data, instead of on the data itself. In other words, we are crawling the presentation, instead of the data that drives the presentation. When data semantics is ignored, we may waste a lot of efforts on pages that do not contain new information. The number of such pages can be orders of magnitude larger than useful ones.
$200 & Above --> Handbags:
Handbags!$200 --> Above:
Rice --> Stews:
Stews --> Rice:
Figure 3: Different URLs for equivalent queries
- We conducted a comprehensive survey over 443 highly ranked web sites in 20 major domains (e.g., shopping, travel, real estate, etc). Figure 4 shows that 127 out of the 433 web sites (28.7%) use exploratory search. In particular, shopping web sites are the most aggressive in adopting the exploratory model: 97 out of top 133 e-commerce web sites (72.9%) use exploratory search.
A crawler that ignores the exploratory search semantics and focuses on the final pages of each entity will miss a lot of information about the entity. Our survey showed that among the 127 web sites that use exploratory search, only 29 or 23% put complete information about each entity on the entity page. A large majority (77%) contains incomplete information.
Fig. 4. Survey of 433 highly ranked web sites
- Technical report: Please refer to our publications.
Data sets and result statistics:
- We selected 6 web sites from the 107 exploratory sites as our experimental datasets. The following table shows the the statistics of the results for the 6 web sites. For each dataset, we list the number of discovered entities, attributes and entity-attribute pairs.
Website Type Subcategory # of uncovered entities # of uncovered attributes # of uncovered entity-attributes homes.com Real Estate Subset of New York 32 34 336 food.com Food Entire Site 434,854 476 5,542,792 menupage.com Restaurant American New 560 87 57,475 amazon.com Shopping Shoes 248,995 156 1,077,444 shopping.yahoo.com Shopping Electronics 115,748 1,165 649,522 zappos.com Shopping Entire Site 124,839 1951 2,208,052
How results look like:
- Menupage (508 objects, 5000 object/attribute pairs)
- Zappos (237 objects, 5000 object/attribute pairs)
- Amazon (1762 objects, 5000 object/attribute pairs)
Download all the results:
- Zappos (11 MB compressed, 216 MB uncompressed)
- Amazon (22 MB compressed, 447 MB uncompressed)