How 2018 Software is Different From 2004

During her presentation to class on Friday, Jetlore’s COO Montse Medina shared the story of how Jetlore got started, and specifically how software in 2018 is different from 2008. Her startup story remind me of my own.

In 2004, I co-founded CollectiveTrust Solutions to develop anti-phishing technologies. I was coming off of five years building the NetCaptor web browser, and some of my experiences there pushed me towards trying to keep end-users safe from phishing attacks.

Like Jetlore, our first product had its origins when a customer gave us data and asked us to help them solve a problem. My co-founder Blake Hayward was at a conference when a representative from America Online stood on top of a chair and issued an open call to potential vendors to help them detect phishing sites. AOL would share millions of URLs from emails reported as spam and vendors would tell them which ones were fraudulent. We were involved in a “bake-off” with some of the largest security companies in the world.

There was no cloud in 2004. I had leased dedicated servers on various occasions to manage downloads and run ad systems for NetCaptor, but those were fairly expensive. Inspired by recent advances in spam detection algorithms (Paul Graham published his seminal “A Plan for Spam” in August 2002), I built a simple statistical classier in Python that tokenized and classified URLs using hand-curated examples. Every day AOL would deliver a file with several million URLs that had been caught by their spam detection algorithms. We’d run it through the classifier and send back the highest risk URLs.

The classifier ran on a low-end generic PC running Linux in the corner of my office at home. I had a fast internet connection. A daily cron job would connect to AOL’s FTP server, download that day’s file, then run them through the classifier and send the results to AOL. My co-founder and I would review the results and spot check those we missed to add new training examples.

We didn’t win the bake-off outright, but we did well enough to stay in the game for close to two years. I also developed client-side (browser plugin) software that used content-matching to detect and warn users about phishing sites. We licensed that software to AOL in 2006 and soon after sold the company to MarkMonitor.

If I was building this MVP today, I would take a very different approach than I did in 2004. There would be no little white PC clone in the corner. Among current cloud providers, I have the most familiarity with Amazon’s AWS offerings, so I’d start there.

First, I’d use S3 for blob storage and archiving the data from AOL. S3 has integrated support for server-side encryption for security.

I would still build the initial classifier in Python, but this time, I’d use a library like scikit instead of hand-coding the classifier. I would probably need to write a custom plugin for scikit to persist token info and would lean towards using DynamoDB for that.

I would use a Amazon Lambda scheduled function to do the daily data pull and push that data into S3. Another Lambda function would be listening to S3 events and would process the incoming file automatically, generating the list of riskiest URLs and delivering them to us and AOL.

After the MVP phase, I would develop a more robust workflow using some of other AWS technologies. Lambda has some limitations (5 minute execution limit, etc) so some functionality might need to move to a worker-queue system running in EC2. AWS also offers more advanced model-building tools and it would be worth investigating whether Comprehend (NLP and text analytics) would be valuable for extracting additional insights from our phishing URL corpus.

I appreciated Medina’s admonishment that not every technology should be a business. She reminded us that tech is easy, but business is hard. Like Medina at Jetlore, we should have started with a better handle on the business opportunity. And like Jetlore, we were very, very lucky to be successful.