Data Crawling by Link Semantics, WebQ

Data Crawling by Link Semantics


  • Jiuming Huang (v-jihuan @ microsoft . com)
  • Haixun Wang (haixunw @ microsoft . com)

The problem:

  • Information is hidden in user interaction. A growing number of web sites support exploratory search. The underlying data has many features. Users explore the data by forming queries in web interaction. A detailed page that shows a record in the query result, however, only contains a subset of features. We find that on average over 80% features are embedded in user interaction, and only less than 40% are on the detailed pages. The challenge is thus to recover those missing features so that web crawling can find more complete set of features for the data.

Detailed page of a product on

Fig. 1. Detailed page of a product on

Attribute/values of the product

Fig. 2. Attribute/values of the product (those in red are missing in the detailed page.)


  • We conducted a comprehensive survey over 443 highly ranked web sites in 20 major domains (e.g., shopping, travel, real estate, etc). Figure 4 shows that 127 out of the 433 web sites (28.7%) use exploratory search. In particular, shopping web sites are the most aggressive in adopting the exploratory model: 97 out of top 133 e-commerce web sites (72.9%) use exploratory search.

    A crawler that ignores the exploratory search semantics and focuses on the final pages of each entity will miss a lot of information about the entity. Our survey showed that among the 127 web sites that use exploratory search, only 29 or 23% put complete information about each entity on the entity page. A large majority (77%) contains incomplete information.

    Fig. 4. Survey of 433 highly ranked web sites


Our solution:

Data sets and result statistics:

  • We selected 6 web sites from the 107 exploratory sites as our experimental datasets. The following table shows the the statistics of the results for the 6 web sites. For each dataset, we list the number of discovered entities, attributes and entity-attribute pairs.
    Website Type Subcategory # of uncovered entities # of uncovered attributes # of uncovered entity-attributes Real Estate Subset of New York 32 34 336 Food Entire Site 434,854 476 5,542,792 Restaurant American New 560 87 57,475 Shopping Shoes 248,995 156 1,077,444 Shopping Electronics 115,748 1,165 649,522 Shopping Entire Site 124,839 1951 2,208,052

How results look like:

  • Menupage (508 objects, 5000 object/attribute pairs)
  • Zappos (237 objects, 5000 object/attribute pairs)
  • Amazon (1762 objects, 5000 object/attribute pairs)

Download all the results:

  • Zappos (11 MB compressed, 216 MB uncompressed)
  • Amazon (22 MB compressed, 447 MB uncompressed)

Semantic Crawler