A.K.A. Let's see how difficult it is to write a script that programmatically solves one of the most popular WordPress Captcha plugins. [tl;dr: It's not very hard at all.]
So, I stepped away from this blog for a brief period of time and, upon return, noticed an aggregation of spam comments throughout many of my posts. Blog spam is an ongoing epidemic but one that I didn't put much consideration into when creating this blog. In fact, I did nothing more than install the #1 ranked plugin when searching "WordPress Captcha" on Google. This plugin, developed by BestWebSoft, calls on users to complete rudimentary math problems before registering or submitting comments. At first glance, I applauded its simplicity when compared to the much more infuriating image deciphering ploys common with captcha services.
The jargon used in the title of this post relates to some of the common techniques spammers use to falsely inflate the reach and popularity of their often malicious or ad-ridden content. In its simplest form, the more times that a specific URL or keyword is referenced by external blogs and websites, the higher it will appear in search results and directories. Google's PageRank algorithm, for instance, is described as being "based on things like the number of links leading to that page. Pages with higher PageRank are more likely to appear at the top of Google search results." Search engine giants are constantly tweaking their site indexing formulas to thwart off would-be spammers, but spammers are likewise continuously evolving their own methods of gaming the system.
As you can see in the typical representation of blog spam shown above, the content is mostly gibberish and often includes a fishy URL and select terminology that they are trying to drive traffic toward. The links themselves are comprised mostly of affiliate links, malware or ad-ridden dummy content, and fabricated promises of free goodies for completing illegitimate surveys. All of these spam messages were trickling through my site even though I had activated and successfully tested the WordPress Captcha plugin. How, I wondered, were they bypassing the system?
Sure, it is possible for spammers to manually enter captcha data, but doing so severely hinders their otherwise automated output. Many applications have been designed to purposely assist with bypassing captcha blocks—jDownloader, as an example, implements many advanced methods to circumvent common captcha services on popular download sites, and when solving is not possible it presents a simple dialog of the captcha image for the user to manually solve. These same tool-assisted captcha solvers can be harnessed by blog spammers to crank out tens of thousands of gibberish messages on short notice to weakly protected blogs. Perhaps not surprisingly, the most accurate means of solving captcha puzzles is via cheap outsourced human labor, as detailed in this gripping Case Study by UC San Diego. According to the paper (published in 2010), sleazy sites in the underbelly of the web were paying $0.50 to $1 per 1,000 captchas that the person solved, a job I certainly do not envy.
Enough with the Introduction, Let's See the Code!
As a challenge to myself, I wanted to explore the technical feasibility of writing my own custom script from scratch to auto-solve the captcha service that had let more than a dozen spam messages slip through in a few days time. The stats of this particular captcha plugin are impressive. Approximately 20,000 new installs per week, 300,000+ active installs and over 2.5 million installs overall. It has received constant praise, currently sitting at a 4.5/5 star rating on the official WordPress Plugin site. The author includes a brief video demonstration of its use:
With all settings enabled, this captcha system displays a random, elementary-level math problem that may be either addition, subtraction, multiplication or division. One of the three values in each problem is randomly chosen as the unknown and the user is expected to fill in the blank and complete the equation to prove they are a human. To further deride automated solvers, some of the numbers are in long-written form, while others are presented as an integral value. With that understanding, let's get to it!
To create my proof-of-concept captcha solver, I chose to write the script using GreaseMonkey, which supports fast JavaScript development and simplified real-time testing (not to mention it is incredibly useful for locally enhancing third-party sites). To further speed-up the process and simplify some of the syntax, I also make sparse use of jQuery, although that is certainly not mandatory for this small exercise. I will include ample comments in the code for explanation purposes.
Getting Started: Header and Main Variable Declaration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
// ==UserScript== // @name Wordpress Captcha Plugin Solver Proof-of-Concept // @author Matt Pilz // @namespace mattpilz.com // @description Created to illustrate automatic solving of Captcha equations // @include http://mattpilz.com/* // @require https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js // @released 2015-03-20 // @version 1 // @grant none // ==/UserScript== // Ensure jQuery won't conflict with other scripts this.$ = this.jQuery = jQuery.noConflict(true); // Store key-value pairs of written numbers to literal numbers for later processing var wordBase = { 'zero': 0, 'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5, 'six': 6, 'seven': 7, 'eight': 8, 'nine': 9, 'ten': 10, 'eleven': 11, 'twelve': 12, 'thirteen': 13, 'fourteen': 14, 'fifteen': 15, 'sixteen': 16, 'seventeen': 17, 'eighteen': 18, 'nineteen': 19 }; var wordTens = { 'twenty': 2, 'thirty': 3, 'forty': 4, 'fifty': 5, 'sixty': 6, 'seventy': 7, 'eighty': 8, 'ninety': 9 }; var captchaComponent; // Array for storing individual captcha components |
The initial code block above achieves the following:
- Defines the GreaseMonkey specific metadata, as required by all GreaseMonkey scripts. In between the
UserScript
tags, I establish which domain(s) to include for running the script, specify any required external dependencies (jQuery in this case), and other random bits of information. - Creates key-value arrays to associate the written numbers with their integer counterparts. The target captcha system never exceeds 99 so going beyond that is not necessary here.
- Defines a global variable to temporarily store the individual components of the captcha equation for processing and solving.
The next code block will be the heart of the script, but with a couple referenced functions yet to be defined:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
// On Page Load $(document).ready(function() { // Exit code block if captcha doesn't exist on page if (!$("#cptch_input").length) return; // Insert a hidden underscore before input field to mark the missing piece $("#cptch_input").before("<span style="display: none;">_</span>"); // Retrieve full captcha string var captchaString = $.trim($(".cptch_block").text().replace(/\s{2,}/g, ' ')); // Split the string to get the individual components of captcha equation captchaComponent = captchaString.split(" "); // Loop through equation number components for (var i = 0; i < captchaComponent.length; i += 2) { // Store current component value var captchaVal = captchaComponent[i]; // If the current value is not the missing number, prepare it if (captchaVal != "_") { // Number is written as a word, convert back to integer if (isNaN(parseInt(captchaVal))) captchaComponent[i] = wordToNumber(captchaVal); // Make certain final value is an integer and not a string captchaComponent[i] = parseInt(captchaComponent[i]); } } // Sometimes the resultant value will be two written words (i.e., thirty six), merge accordingly if (captchaComponent.length == 6) { // Merge the two words into one and store in appropriate index captchaComponent[4] = parseInt(captchaComponent[4].toString() + wordToNumber(captchaComponent[5])); // Lop off the end index that is no longer needed captchaComponent.splice(-1, 1); } // Finally, we can automatically solve the equation and fill in the answer field $("#cptch_input").val(getAnswer()); // We can auto-populate the other fields for good measure, as a real spambot would $("#author").val("Random Guy"); $("#email").val("randomemail@example.com"); $("#url").val("http://phishyurl.scam/"); $("#comment").text("Lorem ipsum and the likes..."); }); |
The code block above performs the following tasks:
- Waits for the document to load and then checks to see if the target captcha element exists on the page before proceeding.
- Marks the missing piece of the equation for future reference.
- Converts and cleans the the full captcha equation and separates each component into an array.
- Iterates through the components and additionally processes written words into numbers.
wordToNumber()
not yet implemented. - Automatically populates the various input fields, including the answer field of the captcha.
getAnswer()
not yet implemented.
There are still two undefined functions referenced in the code above. The wordToNumber() function will take in a string and convert it to an integral value, as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
function wordToNumber(val) { // Make sure value is lowercase val.toLowerCase(); // First loop through word bases (0-19) and convert value if found for (wVal in wordBase) { if (wVal == val) return wordBase[wVal]; } // If still here, loop through ten bases (20-90) and convert value if found for (wVal in wordTens) { if (wVal == val) return (captchaComponent.length == 6) ? wordTens[wVal] : wordTens[wVal] * 10; } // Execution will never get here in this test case return val; } |
This function gets around the random written-word obstacle of the captcha plugin.
Here we compare the written-word representation of a number, passed from the main code block, to the associative array values defined at the top of the script. If the string is "twelve", for instance, then the first loop in this function will find a match in the wordBase[]
array and return the integer value 12. If the passed value is greater than 19, then we check against the tens-place array, wordTens[]
, instead whereby "thirty" will be replaced with "3", etc. In the event that the word number is exactly 20, 30, 40, etc. then it is multiplied by 10 to still return the correct value.
Finally, the getAnswer()
function will be defined to parse the known values and operator and make the correct calculation as a result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
function getAnswer(val1, val2) { var val1, val2, answer; // Temporary variables for housing known values and calculated answer var operator = captchaComponent[1]; // Original mathematic operator from equation // First number is missing (i.e., "_ + 5 = 7") if (captchaComponent[0] == "_") { val1 = captchaComponent[2]; val2 = captchaComponent[4]; // Calculate final answer based on known values and operator switch (operator) { case "−": answer = val1 + val2; break; case "+": answer = val2 - val1; break; case "×": answer = val2 / val1; break; case "/": answer = val2 * val1; break; } } // Second number is missing (i.e., "5 + _ = 7") else if (captchaComponent[2] == "_") { val1 = captchaComponent[0]; val2 = captchaComponent[4]; // Calculate final answer based on known values and operator switch (operator) { case "−": answer = val1 - val2; break; case "+": answer = val2 - val1; break; case "×": answer = val2 / val1; break; case "/": answer = val2 * val1; break; } } // Third number is missing (i.e., "5 + 2 = _") else { val1 = captchaComponent[0]; val2 = captchaComponent[2]; // Calculate final answer based on known values and operator switch (operator) { case "−": answer = val1 - val2; break; case "+": answer = val1 + val2; break; case "×": answer = val1 * val2; break; case "/": answer = val1 / val2; break; } } return answer; } |
The final piece of the script, above.
There is some noted redundancy in a few places of the final function that could be consolidated or rewritten, alas it gets the job done. There are some nice JavaScript equation solver libraries floating around, but that is certainly overkill for this small and targeted captcha solving script. In the function above, the location within the equation that the missing value belongs to is first determined, then the answer is calculated based on the known values and given operator. The form could easily be auto-submitted once populated by calling $("#submit-comment").click();
as well.
In conclusion, it is evident that this particular captcha system is quite easy to solve in an automated manner. More dependable and less intrusive solutions, which do not require solving captcha-like puzzles but instead rely on comment analysis and blacklist databases, are widely available. I am currently test-driving the popular hosted anti-spam solution Akismet in conjunction with the Bad Behavior plugin and will report the results at a later date.