Google Online Security Blog: March 2010

March 30, 2010

Posted by Neel Mehta, Security Teamdiscussedsuspicious account activity alerts

March 30, 2010

We use a sample of the data that our system generates to train the classifier that lies at the core of our automatic system using a machine learning algorithm. Coming up with good labels (phishing/not phishing) for this data is tricky because we can’t label each of the millions of pages ourselves. Instead, we use our published phishing page list, largely generated by our classifier, to assign labels for the training data.

You might be wondering if this system is going to lead to situations where the classifier makes a mistake, puts that mistake on our list, and then uses the list to learn to make more mistakes. Fortunately, the chain doesn’t make it that far. Our classifier only makes a relatively small number of mistakes, which we can correct manually when you report them to us. Our learning algorithms can handle a few temporary errors in the training labels, and the overall learning process remains stable.

How well does this work?

Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10 phishing pages. Our classification system only incorrectly flags a non-phishing site as a phishing site about 1 in 10,000 times, which is significantly better than similar systems. In our experience, these “false positive” sites are usually built to distribute spam or may be involved with other suspicious activity. While phishers are constantly changing their strategies, we find that they do not change them enough to reliably escape our system. Our experiments showed that our classification system remained effective for over a month without retraining.

If you are a webmaster and would like more information about how to keep your site from looking like a phishing site, please check out our post on the Webmaster Central Blog. If you find that your site has been added to our phishing page list ("Reported Web Forgery!") by mistake, please report the error to us.

Posted by Colin Whittaker, Anti-Phishing Team phishingSafeBrowsing APIpaper17th Annual Network and Distributed System Security SymposiumWhat we look for

Our system analyzes a number of webpage features to help make a verdict about whether a site is a phishing site. Starting with a page’s URL, we look to see if there is anything unusual about the host, such as whether the hostname is unusually long or whether the URL uses an IP address to specify the host. We also look to see if the URL contains any phrases like “banking” or “login” that might indicate that the page is trying to steal information.

We don’t just look at the URL, though. After all, a perfectly legitimate site could certainly use words like “banking” or “login.” We collect a snapshot of the page’s content to examine it closely for phishing behavior. For example, we check to see if the page has a password field or whether most of the links point to a common phishing target, as both of these characteristics can be a sign of phishing. Additionally, we pick out some of the most characteristic terms that show up on a page (as defined by their TF-IDF scores), and look for terms like “password” or “PIN number,” which also may indicate that the page is intended for phishing.

We also check the page’s hosting information to find out which networks host the page and where the page’s servers are located geographically. If a site purporting to be an American bank runs its servers in a different country and is hosted on a local residential ISP’s network, we have a strong signal that the site is bad.

Finally, we check the page’s PageRank to see if the page is popular or not, and we check the spam reputation of the page’s domain. We discovered in our research findings that almost all phishing pages are found on domains that almost exclusively send spam. You can observe this trend in the CCDF graph of the spam reputation scores for phishing pages as compared to the graph of other, non-phishing pages.

How we learn to recognize phishing pages
We use a sample of the data that our system generates to train the classifier that lies at the core of our automatic system using a machine learning algorithm. Coming up with good labels (phishing/not phishing) for this data is tricky because we can’t label each of the millions of pages ourselves. Instead, we use our published phishing page list, largely generated by our classifier, to assign labels for the training data.

You might be wondering if this system is going to lead to situations where the classifier makes a mistake, puts that mistake on our list, and then uses the list to learn to make more mistakes. Fortunately, the chain doesn’t make it that far. Our classifier only makes a relatively small number of mistakes, which we can correct manually when you report them to us. Our learning algorithms can handle a few temporary errors in the training labels, and the overall learning process remains stable.

How well does this work?

Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10 phishing pages. Our classification system only incorrectly flags a non-phishing site as a phishing site about 1 in 10,000 times, which is significantly better than similar systems. In our experience, these “false positive” sites are usually built to distribute spam or may be involved with other suspicious activity. While phishers are constantly changing their strategies, we find that they do not change them enough to reliably escape our system. Our experiments showed that our classification system remained effective for over a month without retraining.

If you are a webmaster and would like more information about how to keep your site from looking like a phishing site, please check out our post on the Webmaster Central Blog. If you find that your site has been added to our phishing page list ("Reported Web Forgery!") by mistake, please report the error to us.

Detecting suspicious account activity

March 24, 2010

Gmail BlogPosted by Pavni Diwanji, Engineering Directorremote sign out and information about recent account activity

IP addressGmail privacy policy

keeping your information secureGoogle Online Security Blog

Meet skipfish, our automated web security scanner

March 19, 2010

Posted by Michal ZalewskiratproxyBrowser Security Handbookimprove the security of third-party browsersskipfish

High speed: written in pure C, with highly optimized HTTP handling and a minimal CPU footprint, the tool easily achieves 2000 requests per second with responsive targets.

Ease of use: the tool features heuristics to support a variety of quirky web frameworks and mixed-technology sites, with automatic learning capabilities, on-the-fly wordlist creation, and form autocompletion.

Cutting-edge security logic: we incorporated high quality, low false positive, differential security checks capable of spotting a range of subtle flaws, including blind injection vectors.

ratproxyskipfishthis pageavailable here

Federal Support for Federated Login

March 3, 2010

Posted by Eric Sachs, Senior Product Manager, Google Securitydiscussed the progressOpenIDcreation of a pilot initiativeOpen Identity Exchange (OIX)today namedready to accept

Security Blog

The chilling effects of malware

Phishing phree

Detecting suspicious account activity

Meet skipfish, our automated web security scanner

Federal Support for Federated Login

Labels

Archive

Feed