Which OCR toolset is good and why : A comparative study
Keywords:ABBYY finereader, Calamari, Google Docs, OCR, Tesseract
Optical Character Recognition (OCR) is a very active research area in many scientific disciplines like pattern recognition, natural language processing (NLP), computer vision, biomedical informatics, machine learning and artificial intelligence. This computational technology extracts the text in editable format ( MS Word/Excel, text files etc.) from PDF files, scanned or hand-written documents, images ( photographs, advertisements etc.) for further processing and has been utilized in many real world applications including banking, education, insurance, finance, healthcare and keyword based search in documents etc. Many OCR toolsets are available under various categories including open source, proprietary and online services. This research paper provides a comparative study of various OCR toolsets considering a variety of parameters.
Asad, F., Ul-Hasan, A., Shafait, F. & Dengel, A., 2016. High Performance OCR for Camera-Captured Blurred Documents with LSTM Networks. In 12th IAPR Workshop on Document Analysis Systems (DAS) (pp. 7-12). IEEE.
Bokser, M., 1992. Omnidocument technologies. Proceedings of the IEEE, 80(7), pp.1066-1078.
Borisyuk, F., Gordo, A. and Sivakumar, V., 2018. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 71-79). ACM.
Breuel, T.M., 2008. The OCRopus open source OCR system. In Document Recognition and Retrieval XV (Vol. 6815, p. 68150F). International Society for Optics and Photonics.
Breuel, T.M., Ul-Hasan, A., Al-Azawi, M.A. & Shafait, F., 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 12th International Conference on Document Analysis and Recognition (pp. 683-687). IEEE.
Cao, H., 2014. Machine-printed character recognition. Handbook of Document Image Processing and Recognition: 331-358.
Dhiman, S. and Singh, A., 2013. Tesseract vs gocr a comparative study. International Journal of Recent Technology and Engineering, 2(4), p.80.
Gabasio, A., 2013. Comparison of optical character recognition (OCR) software. Master’s thesis Lund University, Sweden.
Goswami, R. & Sharma, O.P., 2013. A Review on Character Recognition Techniques. International Journal of Computer Applications 83(7).
Mori, S., Suen, C.Y. & Yamamoto, K., 1992. Historical review of OCR research and development. Proceedings of the IEEE, 80(7), pp.1029-1058.
Namysl, M. & Konya, I., 2019. Efficient, Lexicon-Free OCR using Deep Learning. arXiv preprint arXiv:1906.01969.
Patel, C., Patel, A. & Patel, D., 2012. Optical character recognition by open source OCR tool tesseract: A case study. International Journal of Computer Applications, 55(10), pp.50-56.
Reul, C., Dittrich, M. and Gruner, M., 2017. Case Study of a highly automated Layout Analysis and OCR of an incunabulum:'Der Heiligen Leben'(1488). In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (pp. 155-160). ACM.
Smith, R., 2007. An overview of the Tesseract OCR engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) (Vol. 2, pp. 629-633). IEEE.
Tafti, A.P., Baghaie, A., Assefi, M., Arabnia, H.R., Yu, Z. & Peissig, P., 2016. OCR as a service: an experimental evaluation of Google Docs OCR, Tesseract, ABBYY FineReader, and Transym.
International Symposium on Visual Computing. pp. 735-746. Springer, Cham.
Tomaschek, M. 2018. Evaluation of off-the-shelf OCR technologies. Bachelor thesis Masaryk University, Brno, Czech Republic.
Vijayarani, S. & Sakila, A., 2015. Performance comparison of ocr tools. International Journal of UbiComp (IJU), 6(3), pp.19-30.
Vithlani, P. & Kumbharana, C.K., 2015. Comparative Study of Character Recognition Tools. International Journal of Computer Applications 118(9).
Wick, C., Reul, C. & Puppe, F., 2018. Calamari-A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. arXiv preprint arXiv:1807.02004.