Characterizing Web Content, User Interests, and Search Behavior by Reading Level and Topic

WSDM 2012: Proceedings of the 5th ACM International Conference on Web Search and Data Mining |

Published by ACM

A user’s expertise or ability to understand a document on a given topic is an important aspect of that document’s relevance. However, this aspect has not been well-explored in information retrieval systems, especially those at Web scale where the great diversity of content, users, and tasks presents an especially challenging search problem. To help improve our modeling and understanding of this diversity, we apply automatic text classifiers, based on reading difficulty and topic prediction, to estimate a novel type of profile for important entities in Web search – users, websites, and queries. These profiles capture topic and reading level distributions, which we then use in conjunction with search log data to characterize and compare different entities.
We find that reading level and topic distributions provide an important new representation of Web content and user interests, and that using both together is more effective than using either one separately. In particular we find that: 1) the reading level of Web content and the diversity of visitors to a website can vary greatly by topic; 2) the degree to which a user’s profile matches with a site’s profile is closely correlated with the user’s preference of the website in search results, and 3) site or URL profiles can be used to predict ‘expertness’ – whether a given site or URL is oriented toward expert vs. non-expert users. Our findings provide strong evidence in favor of jointly incorporating reading level and topic distribution metadata into a variety of critical tasks in Web information systems.