An IP-Intelligence Framework

Overview:

We explore host network-level properties, in particular, host IP address properties,  to derive the rich contexts of a communication between a client and a service. Examples of such properties include whether the host is set up with a dynamically allocated IP address, whether the host is from a large proxy server with many users behind it, or whether the host has been associated with already identified malicious activites.  Such information can be used to improve service security and to help service providers better understand user requirements.  Finally, we derive these properties automatically from large service logs.  

Projects:

Introduction:

UDMap: Usage-based Dynamic IP-address Map

We developed a novel method, called UDmap, to identify dynamically assigned IP addresses and analyze their dynamics pattern. UDmap is fully automatic, and relies only on application-level server logs that are already available today.

We applied UDmap to a month-long Hotmail user-login trace and identified a large number of dynamic IP addresses -- more than 102 million.By correlating the inferred dynamic IP addresses with Hotmail’s email server log pertaining to three consecutive months, we were able to establish that 97% of mail servers setup on dynamic IPs sent out solely spam emails, likely controlled by zombies. Moreover, these mail servers sent out a large amount of spam -- counting towards over 42% of all spam emails to Hotmail. These results highlight the importance of being able to accurately identify dynamic IP addresses for spam filtering and we suspect of similar benefits for phishing site identification and Botnet detection.

                    

Top 10 ASes with most number of dynamic IP addresses

HostTracker: De-anonymizing the Internet Using Unreliable IDs

Today’s Internet is open and anonymous. While it permits free traffic from any host, attackers that generate malicious traffic cannot typically be held accountable. We develop a system called HostTracker that tracks dynamic bindings between hosts and IP addresses by leveraging application-level data with unreliable IDs. Using a month-long user login trace from a large email provider, we show that HostTracker can attribute most of the activities reliably to the responsible hosts, despite the existence of dynamic IP addresses, proxies, and NATs. With this information, we are able to analyze the host population, to conduct forensic analysis, and also to blacklist malicious hosts dynamically.

             hosttrackerAutomatically inferring host-IP bindings

AutoRE: Signature-based Spamming Botnet Detection

We developed AutoRE, a spam signature generation framework that detects and characterizes spamming botnets by leveraging both spam payload and spam server traffic properties. AutoRE does not require pre-classified training data or white lists. Moreover, it outputs high quality regular expression signatures that can detect botnet spam with a low false positive rate.

Our in-depth analysis of the identified botnets revealed several interesting findings regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic. We believe these observations are useful information in the design of botnet detection schemes.

                        

Distribution of the botnet hosts around the globe

BotGraph: Large-scale Spamming Botnet Detection

Network security applications often require analyzing huge volumes of data to identify abnormal patterns or activities. The emergence of cloud-computing models opens up new opportunities to address this challenge by leveraging the power of parallel computing.

We design and implement a novel system, called BotGraph, to detect a new type of botnet spamming attacks targeting major Web email providers. BotGraph uncovers the correlations among botnet activities by constructing large user-user graphs and looking for tightly connected subgraph components. This enables us to identify stealthy botnet users that are hard to detect when viewed in isolation. To deal with the huge data volume, we implement BotGraph as a distributed application on a computer cluster. We believe both our graph-based approach and our implementations are generally applicable to a wide class of security applications for analyzing large datasets.

                       

An example login graph

Publications