Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We tried all models from openai and google to get data from images and all of them made "mistakes".

The images are tables with 4 columns and 10 rows of numbers and metadata above that are in a couple of fields. We had thousands of images already loaded and when we tried to check those previously loaded images we found quite a few errors.



Multimodal LLMs are not up for these tasks imo. It can describe an image but its not great on tables and numbers. Now on the other hand, using something like Textract to get the text representation of the table and then feeding that into a LLM was a massive success for us.


LLMs don't offer much value for our use case, almost all values are just numbers


Then you should be using something like Textract or other tooling in that space. Multimodal LLMs are no replacement.


We use opencv + tesseract and easyocr


Curious, did that make you "fall back" to more conservative OCR?

Or what else did you do to correct them?


We already had an OCR solution. We were exploring models in case the information source changes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: