Eduarda Mendes Rodrigues, Natasa Milic-Frayling, Martin Hicks, and Gavin Smyth
Standard Web graph representation fails to capture topic association and functional groupings of links and their occurrence across pages in the site. That limits its applicability and usefulness. In this paper we introduce a novel method for representing hypertext organization of Web sites in the form of Link Structure Graphs (LSGs). The LSG captures both the organization of links at the page level and the overall hyperlink structure of the collection of pages. It comprises vertices that correspond to link blocks of several types and edges that describe reuse of such blocks across pages. Identification of link blocks is approximated by the analysis of the HTML Document Object Model (DOM). Further differentiation of blocks into types is based on the recurrence of block elements across pages. The method gives rise to a compact representation of all the hyperlinks on the site and enables novel analysis of the site organization. Our approach is supported by the findings of an exploratory user study that reveals how the hyperlink structure is generally perceived by the users. We apply the algorithm to a sample of Web sites and discuss their link structure properties. Furthermore, we demonstrate that selective crawling strategies can be applied to generate key elements of the LSG incrementally. This further broadens the scope of LSG applicability.