PDF Table Extractor

Abhinav Kumar
3 min readFeb 13, 2021

--

Extract Tabular data from the scanned(non-searchable pdf) documents

Pdf, a portable document format, used widely to exchange information which includes text, audio, video and the almighty Tabular data.

“Look! everywhere I go, you are” by Douglas Ross

So how to handle Tabular Data, one can do is simply copy searchable pdf and paste it to excel. I said okay, but how you segregate data in a different column. Cus all copied data will be paste on the single column. So, one can suggest, now we will use text to column property of excel. I said awesome let's do it.

Nice work mate but where is “table”?

Simply copying is not a solution. We have to use a programming language. So to handle this type of data we have numerous python module that can automate this type of work, like Camelot, Tabula, Pypdf2 etc.

But this will work great if pdf is searchable, what if it's not. In the case, our first aim is obvious, convert it to the searchable pdf. With these 3 simple steps, one can easily attain the required result.

  1. Convert pdf to image:- As you can’t extract text directly from scanned pdf, so you have to take some other route. One comes to mind was using text extraction from the image(officially known as Optical Character Recognition), cus scanned document is basically an image. So, to convert pdf to image there is a python module pdf2image. It will convert the PDF document to an image of any format.
  2. Convert Image to number:- To apply any machine or deep learning model to an image we first have to convert it to a number. And whenever we deal with an image there is a popular python module OPENCV. It will convert image to NumPy array, a collection of the number on which we will apply our next step.
  3. Convert numbers to pdf:- The Last step was an important one. In this step, we will apply OCR. So for this, there is a python module pytesseract. It is a wrapper for Google’s Tesseract-OCR Engine. You have to install Tesseract first for this, Tesseract v5 is the deep learning model for text recognition.

Voilà!! you got searchable pdf.

Sometimes after all this, using the above-mentioned python package, the result is not up to the mark. In that case, you have to tune up some parameter to get the desired result.

Conclusion

In the end, I learned something new, and I hope you found it useful. Thank you for reading my overview! Please leave comments below with suggestions.

I will leave with some more examples.

--

--

Abhinav Kumar
Abhinav Kumar

Written by Abhinav Kumar

Data enthusiast deeply passionate about power of data. Sifting through datasets and bringing insights to life through visualization is my idea of a good time.

No responses yet