Reading documents with OCR can sometimes be very tough to deal with when accuracy is concerned. So we need to do some preprocessing before we feed the image to the OCR.
Here are the steps which I am performing before doing OCR.
Step 1: Proper dimensions
I am using tesseract so it's better that our image is store in 300 DPI. If your image has more than 300 words so it’s better to make your image dimension around 2500 * 2500.
image = Image.open(filename)
image = image.convert(mode='L')
factor = max(1, float(2500.0 / length_x))
size = int(factor * length_x), int(factor * width_y)
image = image.resize(size, Image.ANTIALIAS)
image.save("image.png", dpi=(300, 300))
First of all change image to the grayscale mode for a better result because tesseract is trained on images like binary.
Here I am checking that if my image width is less than 2500 then resize it by a factor as shown in code.
step 2: Threshold and Denoise
For the OCR purpose image must be put through a threshold to get a good result. The following example will calculate the threshold according to the image. There are 4 methods (generic, mean, median, gaussian ) you can play with it.
#Thresholdfrom skimage.filters import threshold_localimage = cv2.imread(filename)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)T = threshold_local(gray, 15, offset = 6, method = "gaussian") #generic, mean, median, gaussian
thresh = (gray > T).astype("uint8") * 255
thresh = ~thresh
When your document has some extra noise around the text it might change the whole OCR prediction context. So it’s better to give a clean image to the OCR. Here we are doing the same we filter small dotted regions.
kernel =np.ones((1,1), np.uint8)
ero = cv2.erode(thresh, kernel, iterations= 1)
img_dilation = cv2.dilate(ero, kernel, iterations=1)# Remove noise
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(img_dilation, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
final = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
if sizes[i] >= 10: #filter small dotted regions
final[labels == i + 1] = 255
step 3: Deskew
The document might be rotated if it’s not placed properly while scanning or when you take a photo. This can be confusing for the OCR system. It might happen you will get no result if the OCR system is not able to understand the image.
special thanks to Stéphane Brunner the creator of deskew library
import numpy as np
from skimage import io
from skimage.color import rgb2gray
from skimage.transform import rotate
from deskew import determine_skew
image = io.imread('input.png')
grayscale = rgb2gray(image)
angle = determine_skew(grayscale)
rotated = rotate(image, angle, resize=True) * 255
If you like this post, HIT Buy me a coffee! Thanks for reading.
Your every small contribution will encourage me to create more content like this.