Detecting Phishing on the “Edge”
Last week I talked about how building software in 2004 was different than in 2018 in the context of an anti-phishing startup I co-founded at that time. This week I want to dive deeper into how we used machine learning.
According to Craig Martell, head of a data science group at Linked In, the goal of machine learning is to “build systems that can operate under uncertainty, that can make predictions under uncertainty, so that humans don’t have to.” In 2004, AOL users were increasingly the targets of phishing attacks. There was no way a human could review all the links that were sent to AOL users. There was no way a human could review the millions of links that were included in email messages marked as spam each day. AOL needed a system that could make predictions about the riskiness of these URLs so that humans didn’t have to.
Martell described how artificial intelligence really works. Combining labelled data and an algorithm yields a function. That function accepts a feature vector as input and returns a prediction (class). We followed this approach building our MVP (before the term MVP was coined). My co-founder and I started by labelling data (manually classifying thousands of URLs by hand). This was our training set. I built a URL tokenizer (breaking URLs into chunks and decorating them with additional metadata) and fed them into a simple statistical model. Based on this model, our prediction function would accept new URL tokens as input (feature vector) and return a prediction (probability that a given URL was a phishing URL). As mentioned previously, this first approach was very fast and performed reasonably well. At that time, the most effective phishing attacks used URLs that looked like real URLs. PayPal was the most common target. The inclusion of the token “path:webscr” that was not connected to the PayPal.com domain (didn’t include token domain:paypal.com) was a strong indicator of the phishiness of a URL.
We experimented with other other approaches, including crawling potential phishing sites, in an attempt to raise our detection rates. While this approach worked, it was still reactive, requiring users to submit a URL as spammy. The turnaround time to detect phishing URLs was still far too long. That led us to explore moving to the “edge” and doing detection in-browser.
Jeff Welser from IBM described the emerging need for machine learning inference (prediction) to happen on the edge. A self-driving car can’t wait for a prediction from the cloud; it must make decisions in real-time. This was exactly what we were trying to achieve with the client-side phishing detector, to protect users in real-time. We experimented with a number of approaches. The most effective analyzed content signatures of phrases and images on a give page and matched those against a local database of signature (think antivirus signatures, but for content). If phishing target content was detected on a site that was requesting user data via form, we would then alert the user. We licensed this technology to AOL for inclusion in their OpenRide browser. Unfortunately, soon after we executed this license, AOL killed the OpenRide project so our client-side solution was never deployed.