Cатсн²² (in)sесuяitу / ChrisJohnRiley

Because we're damned if we do, and we're damned if we don't!

Tag Archives: CAPTCHA

Python OCR… or how to break CAPTCHAs

After my little stint writing the scr.im PoC script, a few people on Twitter reminded me of a blog post that Andreas Riancho from Bonsai-sec wrote back in February. Andreas (the creator of the excellent W3AF tool) wrote a short Python script to take a CAPTCHA image and perform an OCR on it. As a geek, this piqued my interest, but the one problem I had with it was that the script relied on the pytesser Python library, which is Windows only!

There were a few issues with that.

  1. It’s Windows only and I prefer to avoid Windows unless there’s no other choice
  2. The project only ever reached version 0.0.1
  3. The project has been abandoned since May 2007

So, not wanting to give up on something that looked fun, and also useful, I started a search for an alternative. I quickly found that the pytesser Python library is a wrapper around the tesseract-ocr project, and that there had been some work on another Python library called Python-Tesseract that looks like it does the job (and isn’t platform dependent).

After installing tesseract-ocr (apt-get install tesseract-ocr on Backtrack) I downloaded the Python-tesseract files and modified the script from Andreas Riancho a little (the actual changes to make things work are minimal). I also changed a few things to get the script to reasonably accurately decode scr.im captcha images.


# [PoC] tesseract OCR script - tuned for scr.im captcha
# Chris John Riley
# blog.c22.cc
# contact [AT] c22 [DOT] cc
# 12/10/2010
# Version: 1.0
# Changelog
# 0.1> Initial version taken from Andreas Riancho's \
#      example script (bonsai-sec.com)
# 1.0> Altered to use Python-tesseract, tuned image \
#      manipulation for scr.im specific captchas

from PIL import Image

img = Image.open('captcha.jpg') # Your image here!
img = img.convert("RGBA")

pixdata = img.load()

# Make the letters bolder for easier recognition

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][0] < 90:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][1] < 136:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][2] > 0:
 pixdata[x, y] = (255, 255, 255, 255)

img.save("input-black.gif", "GIF")

#   Make the image bigger (needed for OCR)
im_orig = Image.open('input-black.gif')
big = im_orig.resize((1000, 500), Image.NEAREST)

ext = ".tif"
big.save("input-NEAREST" + ext)

#   Perform OCR using tesseract-ocr library
from tesseract import image_to_string
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

A majority of this code is preparation, the actual OCR job is performed in the final lines using the image_to_string call. Simple isn’t it!

The above script is tuned to the scr.im captcha image. As can be seen by the below examples:

As you can see, after running it through some filters (thanks Andreas), the CAPTCHA becomes a lot clearer, and significantly easier to OCR. Even in this case however, tesseract-ocr sometimes returns the value as W6BHP instead of W68HP. Still, that’s an easy mistake to make… and I’m sure with more tweaking, the preparation could be perfected!

So, next time somebody says “we implemented a CAPTCHA to prevent scripted attacks“, you can take it with a pinch of salt!


  • [PoC] scr.im.tesseract.py script –> here
  • Breaking Weak CAPTCHA in 26 Lines of Code –> bonsai-sec.com
  • Pytesser –> here
  • Tesseract-OCR –> here
  • Python-Tesseract –> here

The secrets of scr.im

Update [12/10/2010]: Since writing this back in October 2009, I’ve written a Proof of Concept Python tool to exploit the vulnerabilities discussed. More information can be found here.

A few days back I was alerted to a new website that was offering a new way to hide your email address online. It sounded interesting so I headed over to the scr.im website and had a quick poke around. As you can guess, I’m a bit suspicious of these kind of services and web application security is what I do (at least recently). So there were a few things that really jumped out at me straight off in reference to the sites use of CAPTCHA.

scr.imNow I’m pretty sure a big portion of people reading this are already saying WTF just by looking at the above screenshot. This is not the way to use CAPTCHA. For a visitor to the site this is nice and quick, however for a script this is just a matter of playing the odds. The correct CAPTCHA has to be one of the 9 possible links displayed, certainly much better odds than attempting to crack the CAPTCHA itself. If a link is selected and it’s incorrect, the CAPTCHA screen can be reloaded (with new CAPTCHA options of course) and you can try again. Sure for a human it’s tedious to do this, it took me 11 tries to get through by simply always selecting the first CAPTCHA option (top left corner, Y9VJJ in the above screenshot). However for a scripted attack, this is a no brainer and shouldn’t take more than a few seconds. The slowest part would be the page reload. There doesn’t seem to be any timeout, lockout or any other such protections to prevent this kind of attack. Still the page reload is the bottleneck here.

burp_scrimThe above solution was too messy for me, and I hate nothing more than having to click buttons constantly, it bores me. So how can we do the same thing, but work more effectively, with less overhead. By taking a look at the scrim.js JavaScript you can see how the CAPTCHA buttons translate directly to POST requests. Although the value of the CAPTCHA is obfuscated, the code is simple enough to understand if you use BURP Suite to capture and examine the requests and responses. The Burp screenshot (LEFT) shows the POST request that sends the CAPTCHA together with a token number and some other information. Initially I thought this token value was used as a replay prevention function, however by capturing all 9 possible POST requests from a simple scr.im page (using Burp naturally), I found that you could send each request in sequence to the remote server until the correct CAPTCHA is found and the desired email address is returned.

As with most web applications the POST request isn’t strictly enforced on the server side. As such you can easily change this to a single GET request (e.g. http://scr.im/test?captcha=46UU8&action=view&token=e6f0006403ab0c714034f25bc571aa93&ajax=y) which makes things a little easier when scripting a solution.

The screenshots below show the responses from the scr.im server (both negative and positive).


scrim negative response

scrim negative response



scrim positive response

scrim positive response



As most of the heavy work is done on the client-side using JavaScript (obfuscation of the CAPTCHA value, etc..), I don’t think it would take much for a good scripter (that rules me out most likely) to script up something that could quite simply go through and harvest addresses from the site. Normally this wouldn’t surprise me much as spammers are harvesting emails all the time from various sources. However scr.im is placing itself as a way to “… protect your email address before sharing it, so only real people will use it …” That’s a problem, as most people will blindly think that the service must be secure.

Don’t get me wrong here, the idea is sound, and I’d really like to use something like this myself under certain circumstances (not for my primary accounts, but for some things certainly). So what can be done to fix things up .:

  • Add protection to prevent multiple options from being selected from the same CAPTCHA options (one-time token) *
    • Resolves: The possibility to select ALL 9 CAPTCHA options and scraping the responses
    • Issue: Doesn’t stop an attacker reloading and trying again
  • Add protection to stop more than X number of requests per URL, per second/minute *
    • Resolves: Brute-Force style attacks, automated attacks (to some extent)
    • Issue: Could cause DoS against peoples links
  • If the wrong CAPTCHA is selected X times, revert to traditional CAPTCHA entry *
    • Resolves: Brute-Force style attacks, automated attacks
    • Issues: Same issues as traditional CAPTCHA, breakable

* Won’t prevent all possible attacks if implemented alone

By mixing and matching the above solutions the attack vectors could easily be reduced, but never truly removed completely. However with the use of the 3×3 grid CAPTCHA there is always that 1 in 9 chance of the attacker choosing the correct CAPTCHA on the first try. Given these odds a single attacker could (in theory) harvest 11.11% of email addresses on the first attempt purely by guessing (even with the above implemented restrictions). Depending on how/what protections are implemented, the attacker could then come back 8 more times (within minutes, hours, days) to scrape the remaining addresses. There will certainly be email addresses not scraped, but for a spammer, this isn’t a big issue.

With this in mind, the only workable solution is to alter the way CAPTCHA protection is implemented to go with the more traditional “typing” option. It’s tried and tested, and although it’s not perfect, it does resolve a number of the issues faced by scr.im currently.

Disclaimer: I’m not a developer, and I have the utmost respect for those who take time to put up services like this online. Hopefully people can learn from what I’ve explained here, so we don’t all make the same mistakes next time.