Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

from blog Prahlad Yeri, 11 Nov 2024 | ↗ original

Extracting text from a PDF is usually straightforward when it’s in English and doesn’t have embedded fonts. However, once those assumptions are removed, it becomes challenging to use basic python libraries like pdfminer or pdfplumber. Last month, I was tasked with extracting text from a Gujarati-language PDF and importing data fields such as...

This is a short summary. ↗ Open original to view full content

Extract text from a PDF

John D. Cook | original ↗

Analyzing 50k fonts using deep neural networks

Home on Erik Bernhardsson | original ↗

Confidential OCR

John D. Cook | original ↗

Eloquent JavaScript's Build System

marijnhaverbeke.nl/blog | original ↗

Cropping texts in Python with 'textwrap.shorten'

Redowan's Reflections | original ↗

Mailbag: Parsing Fields from PDFs—When to Use Machine Learning?

Eugene Yan | original ↗

A look at Go lexer/scanner packages

Fatih Arslan | original ↗

Manipulating text with query expressions in Django

Redowan's Reflections | original ↗

Adding PDFs to Your Webpage without JavaScript

Raymond Camden | original ↗

Sample Origami site for an outdoor trekking company

Jan Miksovsky’s blog | original ↗

More from Prahlad Yeri

The ultimate guide to boosting productivity with the Pomodoro Technique

8 Dec 2024 | original ↗

In a world buzzing with distractions, maintaining focus and productivity can feel like trying to catch a greased pig. Whether it’s the relentless ping of notifications, the lure of social media, or the temptation to procrastinate, staying on task requires a solid strategy. One technique that has stood the test of time for its simplicity and...

Master PDF digital signing with eMudhra and Proxkey in .NET: step-by-step guide

2 Dec 2024 | original ↗

One of my recent projects involved coding a simple .NET app that signs PDFs in bulk using Proxkey digital signature. These purchased CA verified digital signatures, like eMudhra and Proxkey, typically come with their own custom apps which can digitally sign PDFs. You can even use software like Adobe Reader for this purpose. However, the...

Building a Secure and Efficient MQTT-HTTP Gateway with Node.js

1 Dec 2024 | original ↗

One of my recent projects involved creating a Node.js script that acts as a gateway or middleware, capturing messages from an MQTT broker and relaying (streaming) them on the HTTP side via GET requests. Similarly, it should also handle POST requests from the HTTP side and publish them to the corresponding broker on the respective topic. Why use...

Building a simple customer management system in PHP with MySQL

20 Nov 2024 | original ↗

Creating a customer management system (CMS) is a great way for beginners to learn PHP and MySQL. This hands-on project will guide you through the process of building a simple CMS from scratch, covering database design, CRUD operations, and form handling. What is a customer management system? A customer management system is a tool used to store...

The hidden costs of ERP implementation: beyond the initial price tag

17 Nov 2024 | original ↗

When businesses decide to implement an Enterprise Resource Planning (ERP) system, the focus often zeroes in on the initial price tag. However, the true cost of ERP implementation extends far beyond the upfront investment. Hidden costs can significantly impact the overall budget and success of the project. In this article, we will delve into the...

Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Related

More from Prahlad Yeri