Cатсн²² (in)sесuяitу / ChrisJohnRiley

Because we're damned if we do, and we're damned if we don't!

Python OCR… or how to break CAPTCHAs

7 Comments Posted by ChrisJohnRiley on October 12, 2010

After my little stint writing the scr.im PoC script, a few people on Twitter reminded me of a blog post that Andreas Riancho from Bonsai-sec wrote back in February. Andreas (the creator of the excellent W3AF tool) wrote a short Python script to take a CAPTCHA image and perform an OCR on it. As a geek, this piqued my interest, but the one problem I had with it was that the script relied on the pytesser Python library, which is Windows only!

There were a few issues with that.

It’s Windows only and I prefer to avoid Windows unless there’s no other choice
The project only ever reached version 0.0.1
The project has been abandoned since May 2007

So, not wanting to give up on something that looked fun, and also useful, I started a search for an alternative. I quickly found that the pytesser Python library is a wrapper around the tesseract-ocr project, and that there had been some work on another Python library called Python-Tesseract that looks like it does the job (and isn’t platform dependent).

After installing tesseract-ocr (apt-get install tesseract-ocr on Backtrack) I downloaded the Python-tesseract files and modified the script from Andreas Riancho a little (the actual changes to make things work are minimal). I also changed a few things to get the script to reasonably accurately decode scr.im captcha images.

#!/usr/bin/python

# [PoC] tesseract OCR script - tuned for scr.im captcha
#
# Chris John Riley
# blog.c22.cc
# contact [AT] c22 [DOT] cc
# 12/10/2010
# Version: 1.0
#
# Changelog
# 0.1> Initial version taken from Andreas Riancho's \
#      example script (bonsai-sec.com)
# 1.0> Altered to use Python-tesseract, tuned image \
#      manipulation for scr.im specific captchas
#

from PIL import Image

img = Image.open('captcha.jpg') # Your image here!
img = img.convert("RGBA")

pixdata = img.load()

# Make the letters bolder for easier recognition

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][0] < 90:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][1] < 136:
 pixdata[x, y] = (0, 0, 0, 255)

for y in xrange(img.size[1]):
 for x in xrange(img.size[0]):
 if pixdata[x, y][2] > 0:
 pixdata[x, y] = (255, 255, 255, 255)

img.save("input-black.gif", "GIF")

#   Make the image bigger (needed for OCR)
im_orig = Image.open('input-black.gif')
big = im_orig.resize((1000, 500), Image.NEAREST)

ext = ".tif"
big.save("input-NEAREST" + ext)

#   Perform OCR using tesseract-ocr library
from tesseract import image_to_string
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

A majority of this code is preparation, the actual OCR job is performed in the final lines using the image_to_string call. Simple isn’t it!

The above script is tuned to the scr.im captcha image. As can be seen by the below examples:

As you can see, after running it through some filters (thanks Andreas), the CAPTCHA becomes a lot clearer, and significantly easier to OCR. Even in this case however, tesseract-ocr sometimes returns the value as W6BHP instead of W68HP. Still, that’s an easy mistake to make… and I’m sure with more tweaking, the preparation could be perfected!

So, next time somebody says “we implemented a CAPTCHA to prevent scripted attacks“, you can take it with a pinch of salt!

Links:

[PoC] scr.im.tesseract.py script –> here
Breaking Weak CAPTCHA in 26 Lines of Code –> bonsai-sec.com
Pytesser –> here
Tesseract-OCR –> here
Python-Tesseract –> here

Penetration Test, Security CAPTCHA, captcha bypass, penetration testing, pytesser, python, python-tesserat, scr.im, tesserat-ocr

← scr.im revisited SANS European Webcasts →

7 responses to “Python OCR… or how to break CAPTCHAs”

3ToGo May 10, 2012 at 17:28
Using python tesseract will be easier
http://code.google.com/p/python-tesseract/
Pingback: Using python image capture, image processing and mouse coordinate to gamebot « The World's Oldest Intern

RSS feed

The contents of this personal blog are solely my own opinions and comments, as such they do not reflect the opinions of my employer(s) past, present or future. No legal liability is accepted for anything you do, think, or consider fact as the basis of articles and links posted on this blog.

"Three to one...two...one...probability factor of one to one...we have normality, I repeat we have normality. Anything you still can’t cope with is therefore your own problem."

Note: A large portion of content I post on my blog comes from "live blogging" of security conferences. These posts are in notes form and are written live during a talk. As such errors and emissions are expected. I'm only human after all!

Cатсн²² (in)sесuяitу / ChrisJohnRiley

Python OCR… or how to break CAPTCHAs

7 responses to “Python OCR… or how to break CAPTCHAs”

Recent Posts

Archives

@ChrisJohnRiley

Disclaimer

Cатсн²² (in)sесuяitу / ChrisJohnRiley

Python OCR… or how to break CAPTCHAs

Rate this:

Share this:

Related

7 responses to “Python OCR… or how to break CAPTCHAs”

Recent Posts

Archives

@ChrisJohnRiley

Disclaimer