OCR to table: pagetext-column (page break to cell) +images column(from folder) (1 Viewer)

Maurycy

New member
Local time
Tomorrow, 00:17
Joined
Mar 16, 2017
Messages
1
Hi and Welcome!

W have several books and documents photographed or scanned.
We want to have them in editable text (and eventually in database).
The amount is owerwhelming and it will take enourmous time and manpower.
So we need an in-between step. Here is what we want to do:

Run each text through OCR program. We correct obvious errors of the OCR engine with a quick page-by page scrolling (make a table where OCR didn't recognize a table, or make a text area where it recognized it as text or image, etc.) but the text is by no means proofread (and often has gibberish sometimes on most of the pages).

The result we get is a Word file - let's call it WORD-1: formatted (size, font, color, bold, italics, superscipt, subscript, color highlight for suspected errors made by OCR app), with pictures, tables, unicode and pictures inline, etc. File is divided with page breaks - we have as many Word pages as we had images with pages.
Often many pages are a mess, with weird paragraph styles and font sizes, etc., but page breaks are correct. It makes 50 to 90% of correct text of entire book/document as it should be, so it's good. When someone will work with specific page, the provisional OCR is already done. He only has to compare the OCR-ed page with a scanned version, to know if there are errors. And when done, he can save the file with what he proofred highlighted in light green.
The problem is each time one has to navigate through folders and find the image file of specific page.

How I see we can improve it:
Have a new Word file: let's call it WORD-2, that is contains a table with 2 (or 3, or 4) columns:

  1. Complex formatted text column: one page of WORD-1 to one cell (page below the page).
  2. Image colum: smaller image files of all pages (one file per page)
  3. Maybe a page number column
  4. Maybe a notes column
A sort of flat-design table but with very complex content of the cells of Page-text column. I believe I cannot make a real database or Excel spreadsheet with such messy content without having to spend several lives to clean this up.
Plus a Word file can be opened by anyone to edit even on the phone.
For working in the infinite-scroll "Web view" as opposed to "Print view" (not sure if that's how they are called in English) the size of the table doesn't matter and I think it will work great, as we have tried to do that manually with several books. Doing it in Excel and Access was impossible (content of a cell and its size is limited, and quick scrolling through such cells-pages is impossible). I think there is no better ready option for me without paying for designing a new database environment (the work is our hobby we do in our free time).



I am kindly asking for help with making Word somehow do the WORD-2 file for me, only being told what WORD-1 file to use and what folder of images to use - and will pair them up as columns of the table.
If he messes it up, by mis-pairing images with pages, I can work with that, by cutting WORD-1 in chunks and paring it with the appropriate fraction of pictures. I am not stupid, and I am not expecting it to be smarter than me if I messed up the image files number or ordering of these images or OCR skipped a page or treated one as two - I will deal with that. But manually copying and pasting each page twice in a table is something I would like to be done for me.


Can I ask for helping me with that?
I have zero knowledge (or not much more than than :D) about macros, scripts, programming or other fancy stuff.

Even one of this things (images in folder to images in table one per row; or Word pages to row-columns) would help me. But having help with both would be like a dream come true in comparizon to the mess we have with these files now (some are OCR-ed and proofread, some ore rough OCRs, some are OCRed in fragments, some are only images - it's a hell when you want to get instant access to the few at the same time and quote something from them).

Thanks
Maurycy
 
Last edited:

Users who are viewing this thread

Top Bottom