Simon Willison on tesseract

2 items tagged “tesseract”

2024

Running OCR against PDFs and images directly in your browser

I attended the Story Discovery At Scale data journalism conference at Stanford this week. One of the perennial hot topics at any journalism conference concerns data extraction: how can we best get data out of PDFs and images?

[... 2263 words]

5:59 pm / 30th March 2024 / datajournalism, ocr, projects, tesseract, aiassistedprogramming

2007

tesseract-ocr. Open source OCR, sponsored by Google. I just sat in on a talk on this at OSCON and the complexity of the problem is pretty incredible. # 26th July 2007, 8:23 pm

Simon Willison’s Weblog

2 items tagged “tesseract”

2024

Running OCR against PDFs and images directly in your browser

2007