Project information

Summary:

This project combined Pillow, OpenCV, and Pytesseract to create an application that can search through a zip file containing scanned newspapers for a word and return all the faces on the page containing the given word.
First, all the images of newspaper pages from the zipfile were extracted using Zipfile library. Then, using Pytesseract library, the scanned pages were converted to text. Using OpenCV, all the faces for a page were identified. A function was defined to take user defined text input as a search string and search for that text in all the text generated from OCR to determine the pages on which the text is found. Then all the images from those pages are pasted using Pillow library functions, for each page and then are displayed.