Trackbacks and Pingbacks and Blog Spam, Oh My!

A.K.A. Let's see how difficult it is to write a script that programmatically solves one of the most popular WordPress Captcha plugins. [tl;dr: It's not very hard at all.]

Wordpress Captcha Plugin

So, I stepped away from this blog for a brief period of time and, upon return, noticed an aggregation of spam comments throughout many of my posts. Blog spam is an ongoing epidemic but one that I didn't put much consideration into when creating this blog. In fact, I did nothing more than install the #1 ranked plugin when searching "WordPress Captcha" on Google.   This plugin, developed by BestWebSoft, calls on users to complete rudimentary math problems before registering or submitting comments.  At first glance, I applauded its simplicity when compared to the much more infuriating image deciphering ploys common with captcha services.

The jargon used in the title of this post relates to some of the common techniques spammers use to falsely inflate the reach and popularity of their often malicious or ad-ridden content.  In its simplest form, the more times that a specific URL or keyword is referenced by external blogs and websites, the higher it will appear in search results and directories. Google's PageRank algorithm, for instance, is described as being "based on things like the number of links leading to that page. Pages with higher PageRank are more likely to appear at the top of Google search results." Search engine giants are constantly tweaking their site indexing formulas to thwart off would-be spammers, but spammers are likewise continuously evolving their own methods of gaming the system.

Example Blog Spam

A sample of blog spam submitted to my website.

As you can see in the typical representation of blog spam shown above, the content is mostly gibberish and often includes a fishy URL and select terminology that they are trying to drive traffic toward.  The links themselves are comprised mostly of affiliate links, malware or ad-ridden dummy content, and fabricated promises of free goodies for completing illegitimate surveys.  All of these spam messages were trickling through my site even though I had activated and successfully tested the WordPress Captcha plugin. How, I wondered, were they bypassing the system?

Sure, it is possible for spammers to manually enter captcha data, but doing so severely hinders their otherwise automated output. Many applications have been designed to purposely assist with bypassing captcha blocks—jDownloader, as an example, implements many advanced methods to circumvent common captcha services on popular download sites, and when solving is not possible it presents a simple dialog of the captcha image for the user to manually solve. These same tool-assisted captcha solvers can be harnessed by blog spammers to crank out tens of thousands of gibberish messages on short notice to weakly protected blogs. Perhaps not surprisingly, the most accurate means of solving captcha puzzles is via cheap outsourced human labor, as detailed in this gripping Case Study by UC San Diego. According to the paper (published in 2010), sleazy sites in the underbelly of the web were paying $0.50 to $1 per 1,000 captchas that the person solved, a job I certainly do not envy.

Enough with the Introduction, Let's See the Code!

As a challenge to myself, I wanted to explore the technical feasibility of writing my own custom script from scratch to auto-solve the captcha service that had let more than a dozen spam messages slip through in a few days time. The stats of this particular captcha plugin are impressive. Approximately 20,000 new installs per week, 300,000+ active installs and over 2.5 million installs overall. It has received constant praise, currently sitting at a 4.5/5 star rating on the official WordPress Plugin site. The author includes a brief video demonstration of its use:

With all settings enabled, this captcha system displays a random, elementary-level math problem that may be either addition, subtraction, multiplication or division. One of the three values in each problem is randomly chosen as the unknown and the user is expected to fill in the blank and complete the equation to prove they are a human.  To further deride automated solvers, some of the numbers are in long-written form, while others are presented as an integral value. With that understanding, let's get to it!

To create my proof-of-concept captcha solver, I chose to write the script using GreaseMonkey, which supports fast JavaScript development and simplified real-time testing (not to mention it is incredibly useful for locally enhancing third-party sites).  To further speed-up the process and simplify some of the syntax, I also make sparse use of jQuery, although that is certainly not mandatory for this small exercise. I will include ample comments in the code for explanation purposes.

Getting Started: Header and Main Variable Declaration

The initial code block above achieves the following:

  1. Defines the GreaseMonkey specific metadata, as required by all GreaseMonkey scripts. In between the UserScript tags, I establish which domain(s) to include for running the script, specify any required external dependencies (jQuery in this case), and other random bits of information.
  2. Creates key-value arrays to associate the written numbers with their integer counterparts. The target captcha system never exceeds 99 so going beyond that is not necessary here.
  3. Defines a global variable to temporarily store the individual components of the captcha equation for processing and solving.

The next code block will be the heart of the script, but with a couple referenced functions yet to be defined:

The code block above performs the following tasks:

  1. Waits for the document to load and then checks to see if the target captcha element exists on the page before proceeding.
  2. Marks the missing piece of the equation for future reference.
  3. Converts and cleans the the full captcha equation and separates each component into an array.
  4. Iterates through the components and additionally processes written words into numbers. wordToNumber() not yet implemented.
  5. Automatically populates the various input fields, including the answer field of the captcha. getAnswer() not yet implemented.

There are still two undefined functions referenced in the code above. The wordToNumber() function will take in a string and convert it to an integral value, as follows:

This function gets around the random written-word obstacle of the captcha plugin.

Here we compare the written-word representation of a number, passed from the main code block, to the associative array values defined at the top of the script. If the string is "twelve", for instance, then the first loop in this function will find a match in the wordBase[] array and return the integer value 12. If the passed value is greater than 19, then we check against the tens-place array, wordTens[], instead whereby "thirty" will be replaced with "3", etc. In the event that the word number is exactly 20, 30, 40, etc. then it is multiplied by 10 to still return the correct value.

Finally, the getAnswer() function will be defined to parse the known values and operator and make the correct calculation as a result:

The final piece of the script, above.

There is some noted redundancy in a few places of the final function that could be consolidated or rewritten, alas it gets the job done. There are some nice JavaScript equation solver libraries floating around, but that is certainly overkill for this small and targeted captcha solving script. In the function above, the location within the equation that the missing value belongs to is first determined, then the answer is calculated based on the known values and given operator. The form could easily be auto-submitted once populated by calling $("#submit-comment").click(); as well.

Programatically Solving WordPress Captcha

Success! The script instantly solves every captcha equation as soon as the page loads.

In conclusion, it is evident that this particular captcha system is quite easy to solve in an automated manner. More dependable and less intrusive solutions, which do not require solving captcha-like puzzles but instead rely on comment analysis and blacklist databases, are widely available. I am currently test-driving the popular hosted anti-spam solution Akismet in conjunction with the Bad Behavior plugin and will report the results at a later date.

Leave a Reply

Time limit is exhausted. Please reload the CAPTCHA.