OCR and Record Linkage
Topics
This post covers the sixteenth lecture in the course: “OCR and Record Linkage.”
Optical Character Recognition (OCR) is central to converting unstructured image data into structured, computable text for economic analyses
Lecture Video
References Cited in Lecture 16: OCR and Record Linkage
OCR Papers
Carlson, Jacob, Tom Bryan, and Melissa Dell. “Efficient OCR for Building a Diverse Digital History.” arXiv preprint arXiv:2304.02737 (2023).
B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE transactions on pattern analysis and machine intelligence 39, (2016).
Graves, Alex, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” In Proceedings of the 23rd international conference on Machine learning, pp. 369-376. 2006. Hannun, Awni. “Sequence modeling with ctc.” Distill 2, no. 11 (2017): e8.
Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, Y.-G. Jiang, Svtr: Scene text recognition with a single visual model, arXiv preprint arXiv:2205.00159 (2022).
M. Li, T. Lv, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, F. Wei, Trocr: Transformer-based optical character recognition with pre-trained models, arXiv preprint arXiv:2109.10282 (2021)
G. Song, Y. Liu, X. Wang, Revisiting the sibling head in object detector, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 11563–11572 (2020).
Du, Y., Li, C., Guo, R., Cui, C., Liu, W., Zhou, J., … & Ma, Y. (2021). PP-OCRv2: bag of tricks for ultra lightweight OCR system. arXiv preprint arXiv:2109.03144. (PaddleOCR, https://arxiv.org/abs/2109.03144)
Noisy Data and Downstream tasks
Srivastava, Ankit, Piyush Makhija, and Anuj Gupta. “Noisy Text Data: Achilles’ Heel of BERT.” In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 16-21. 2020.
S. Rijhwani, A. Anastasopoulos, G. Neubig, Ocr post correction for endangered language texts, arXiv preprint arXiv:2011.05402 (2020).
Dubey, Shiv Ram. “A decade survey of content based image retrieval using deep learning.” IEEE Transactions on Circuits and Systems for Video Technology 32, no. 5 (2021): 2687-2704.
El-Nouby, Alaaeldin, Natalia Neverova, Ivan Laptev, and Hervé Jégou. “Training vision transformers for image retrieval.” arXiv preprint arXiv:2102.05644 (2021). https://github.com/jhgan00/image-retrieval-transformers
Multimodal Record Linking
Arora, Abhishek, Xinmei Yang, Shao Yu Jheng, and Melissa Dell. “Linking Representations with Multimodal Contrastive Learning.” arXiv preprint arXiv:2304.03464 (2023).
Popular OCR Frameworks
EasyOCR (JaidedAI)
PaddleOCR (PaddlePaddle)
Tesseract (HP, Google, Open Source)
TrOCR (Microsoft)
Google Cloud Vision (Paid API)
Baidu OCR (Paid API)
Image Source: Carlson, Jacob, Tom Bryan, and Melissa Dell. (2023) Efficient OCR for Building a Diverse Digital History.