Mining the Web for IP Address Geolocations

MSR-TR-2007-139 |

In this paper, we observe that many Web pages contain geolocation information (address, zipcode, and telephone area code) and many of these geolocation items are directly related to the locations of the IP addresses that host the Web pages. We then design Structon, a system that mines Web pages for IP address geolocations. In Structon, we first extract geolocation information from every crawled Web pages, we then devise a serial of information clustering, false-information filtering, error-correction, and location inferring algorithms to map IP addresses to geolocations. We have run our algorithms on top of a set of 74M Chinese Web pages, from which we are able to identify the geolocations for 8.2M IP addresses, which contain addresses for not only Web servers but also client hosts. We have verified our result with an IP address location table of a major Chinese ISP, the verification shows that the accuracy of Structon is 94.4% at province level.