Cатсн²² (in)sесuяitу / ChrisJohnRiley

Because we're damned if we do, and we're damned if we don't!

Python OCR… or how to break CAPTCHAs

After my little stint writing the scr.im PoC script, a few people on Twitter reminded me of a blog post that Andreas Riancho from Bonsai-sec wrote back in February. Andreas (the creator of the excellent W3AF tool) wrote a short Python script to take a CAPTCHA image and perform an OCR on it. As a geek, this piqued my interest, but the one problem I had with it was that the script relied on the pytesser Python library, which is Windows only!

There were a few issues with that.

  1. It’s Windows only and I prefer to avoid Windows unless there’s no other choice
  2. The project only ever reached version 0.0.1
  3. The project has been abandoned since May 2007

So, not wanting to give up on something that looked fun, and also useful, I started a search for an alternative. I quickly found that the pytesser Python library is a wrapper around the tesseract-ocr project, and that there had been some work on another Python library called Python-Tesseract that looks like it does the job (and isn’t platform dependent).

After installing tesseract-ocr (apt-get install tesseract-ocr on Backtrack) I downloaded the Python-tesseract files and modified the script from Andreas Riancho a little (the actual changes to make things work are minimal). I also changed a few things to get the script to reasonably accurately decode scr.im captcha images.

#!/usr/bin/python

# [PoC] tesseract OCR script - tuned for scr.im captcha
#
# Chris John Riley
# blog.c22.cc
# contact [AT] c22 [DOT] cc
# 12/10/2010
# Version: 1.0
#
# Changelog
# 0.1> Initial version taken from Andreas Riancho's \
#      example script (bonsai-sec.com)
# 1.0> Altered to use Python-tesseract, tuned image \
#      manipulation for scr.im specific captchas
#

from PIL import Image

img = Image.open('captcha.jpg') # Your image here!
img = img.convert("RGBA")

pixdata = img.load()

# Make the letters bolder for easier recognition

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][0] < 90:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][1] < 136:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][2] > 0:
 pixdata[x, y] = (255, 255, 255, 255)

img.save("input-black.gif", "GIF")

#   Make the image bigger (needed for OCR)
im_orig = Image.open('input-black.gif')
big = im_orig.resize((1000, 500), Image.NEAREST)

ext = ".tif"
big.save("input-NEAREST" + ext)

#   Perform OCR using tesseract-ocr library
from tesseract import image_to_string
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

A majority of this code is preparation, the actual OCR job is performed in the final lines using the image_to_string call. Simple isn’t it!

The above script is tuned to the scr.im captcha image. As can be seen by the below examples:

As you can see, after running it through some filters (thanks Andreas), the CAPTCHA becomes a lot clearer, and significantly easier to OCR. Even in this case however, tesseract-ocr sometimes returns the value as W6BHP instead of W68HP. Still, that’s an easy mistake to make… and I’m sure with more tweaking, the preparation could be perfected!

So, next time somebody says “we implemented a CAPTCHA to prevent scripted attacks“, you can take it with a pinch of salt!

Links:

  • [PoC] scr.im.tesseract.py script –> here
  • Breaking Weak CAPTCHA in 26 Lines of Code –> bonsai-sec.com
  • Pytesser –> here
  • Tesseract-OCR –> here
  • Python-Tesseract –> here