We use a sample of the data that our system generates to train the classifier that lies at the core of our automatic system using a machine learning algorithm. Coming up with good labels (phishing/not phishing) for this data is tricky because we can’t label each of the millions of pages ourselves. Instead, we use our published phishing page list, largely generated by our classifier, to assign labels for the training data.

You might be wondering if this system is going to lead to situations where the classifier makes a mistake, puts that mistake on our list, and then uses the list to learn to make more mistakes. Fortunately, the chain doesn’t make it that far. Our classifier only makes a relatively small number of mistakes, which we can correct manually when you report them to us. Our learning algorithms can handle a few temporary errors in the training labels, and the overall learning process remains stable.

How well does this work?

Of the millions of webpages that our scanners analyze for phishing, we successfully identify 9 out of 10 phishing pages. Our classification system only incorrectly flags a non-phishing site as a phishing site about 1 in 10,000 times, which is significantly better than similar systems. In our experience, these “false positive” sites are usually built to distribute spam or may be involved with other suspicious activity. While phishers are constantly changing their strategies, we find that they do not change them enough to reliably escape our system. Our experiments showed that our classification system remained effective for over a month without retraining.

If you are a webmaster and would like more information about how to keep your site from looking like a phishing site, please check out our post on the Webmaster Central Blog. If you find that your site has been added to our phishing page list ("Reported Web Forgery!") by mistake, please report the error to us.