InSite Live
InSite Live

InSite is a tool for visualizing the structure of a Web site that helps Web site visitors to search and browse through the site. It identifies sub-sites within a site and displays the topics they cover in order to assist the users in finding pages of interest. It enables Web site administrators to learn how users interact with their Web sites and how to improve the site organization.

Link Structure Graph (LSG)

The Link Structure Graph model provides a new representation of the Web hyperlink structure based on link blocks. It captures the organization of links at the page level and the overall link structure of the site.The Link Structure Graph model provides a new representation of the Web hyperlink structure based on link blocks. It captures the organization of links at the page level and the overall link structure of the site. The graph includes several types of link blocks:

 Link blocks
  • Structural link blocks: blocks repeated across pages,  typically navigation menus
  • Content link blocks: blocks often grouped by topic association and unlikely to be repeated across pages
  • Isolated links: links found in the body of the text. 

 

Algorithm for LSG Generation

  • Step 1 – Page layout analysis

Parse the HTML Document Object Model (DOM) structure of each individual page
At each DOM level look for lists of consecutive links.

  • Step 2 – Link block classification

Compare similarity of candidate blocks across pages DOM path + block target set
Classify the link blocks into s-node and c-nodes based on their re-usability across pages.

 

  • Step 3 – LSG graph generation

Connect nodes A and B with an edge if any of the target pages of block A contain block B.

 LSG Edges

 

Web Site Structure Analysis

 

Identification of Subsites 

Subsites consist of collections of Web pages within a larger site.

 

Pages from a subsite often share a common template and the same navigation mechanism.

Subsites can be identified by decomposing the LSG into Strongly Connected Components (SCC) of s-nodes.

 

Publications
Project Team
 
Integrated Systems
 
Contact US

Integrated Systems Group

Microsoft Research Ltd

7 JJ Thomson Avenue

Cambridge, CB3 0FB, UK

 

+44 (0) 1223 479 700 (Tel.)

+44 (0) 1223 479 999 (Fax)