A lecture titled "Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs ," was part of UC San Diego's Center for Networked Systems' (CNS) recent Research Review. Read a full account of the research review here (written by Calit2's Tiffany Fox)
There are a number of lexical features and IP-based characteristics for detecting malicious URLs, spam phishing and other exploits that are relevant for trying to predict with Web sites are malicious. "There are various characteristics associated with these sites," said Justin Ma, the UC San Diego computer science grad student who gave the presentation. "The question is: How do you relate these properties of the URLs to the maliciousness of the Web sites?" (FYI, Just in Ma is third from the right in the photo above.)
Ma is part of a team (that includes computer science professors Stefan Savage and Geoff Voelker) that drew malicious URLs from those submitted to "phish tanks" by online users, and compared them with benign URLS from certain online directories that had been previously vetted for validity. Using a probabilistic linear model called "logistic regression" as a classifier, they reduced a set of 30,000 URL features down to 4,000 features for model analysis. They discovered that certain "red flags" indicate malicious intent, including:
1) suspicious ownership of the site
2) where the site is hosted geographically
3) the registration date of the site
4) what kind of connection the server is using
5) the presence of certain URL extensions.
The extension".com," for example, tends to signify a malicious Web site when it is found in the middle of a URL (i.e. "bankofamerica.com" is probably fine, but "bankofamerica.com.cz.rnl" should raise some eyebrows). Ultimately, the researchers would like to create a URL reputation service that will allow users to query URLs via a database to determine their validity.