PDF to Modules (1 Viewer)

dcavaiani

Registered User.
Local time
Today, 14:05
Joined
May 26, 2014
Messages
385
What is THE Best way OR the best product to get a pdf into a text file from where I can process it with VBA?

I do this on another application where I can copy content right off of a web page, paste it into Excel, transfer it to an Access Table, and then process it with the VBA code.

This time, however, I have to start with a .pdf attached to an email. The .pdf could be a supplier's invoice or maybe an Account Statement.
 

theDBguy

I’m here to help
Staff member
Local time
Today, 12:05
Joined
Oct 29, 2018
Messages
21,453
Hi. I don't know about "best," but maybe you could start with PDFtk. Good luck!
 

sxschech

Registered User.
Local time
Today, 12:05
Joined
Mar 2, 2010
Messages
792
I haven't used pdftk to extract text, however, I have used it to extract specific pages which I then used another program (Word) to perform the extract. If the files are small or the time it takes to process the text is acceptable, you may not need pdftk to extract the pages, though only needing the specific pages will speed up the process. It will take a bit of work to set up as you will need to identify if the data are from tables or part of the text. If part of the text, you will need to locate keywords in order to extract the portion of the data you need. If there is no consistency, it will be a challenge to achieve your goal. These samples are only extracting one or two items, but could be expanded to extract more.

To use the code, you will need to have a version of Word capable of opening PDFs



This function extracts the text that contains the key word "Enclosure"
Code:
Function EnclosureText(sFile As String)
'1. To save time Extract specific page: MEMORANDUM (page 1)
'2. Open with Word to locate the Enclosure: text at bottom of page 1.
'3. Return the text
'based on Review test code
'Word Constants
'https://docs.microsoft.com/en-us/previous-versions/office/developer/office-2003/aa211923(v%3Doffice.11)
'Word VBA to find string and then obtain the entire sentence containing that string
'Look at Ron's post 01-12-2005 09:37 am as starting point for setting up the range
'and the find structure.
'Don't really need rest of code since not putting into Excel.
'http://www.vbaexpress.com/forum/archive/index.php/t-1322.html
'20190423
    Dim WordApp             As Object
    Dim worddoc             As Object
    Dim WordRange           As Object
    Dim sFileName           As String
    Dim bAppAlreadyOpen     As Boolean
    Dim stEnclosure         As String
    Dim found               As Integer
    
    bAppAlreadyOpen = True
    'Get an instance of word to work with
    'See if Word is already running
    On Error Resume Next
    Set WordApp = GetObject(, "Word.Application")
    If Err.Number <> 0 Then
        'Launch a new instance of Word
        Err.Clear
        On Error GoTo Error_Handler
        Set WordApp = CreateObject("Word.Application")
        bAppAlreadyOpen = False
    End If
    
    'Determine the Word doc filename without the path or extension
    sFileName = FileNameNoExt(sFile)
    'Open the document
    Set worddoc = WordApp.Documents.Open(sFile)
    Set WordRange = worddoc.Range
    worddoc.ActiveWindow.View.ReadingLayout = False
    WordApp.Application.Visible = True
    
    With WordApp
        'This gets all the text in the document and then function to extract phrase
        'stEnclosure = TrimAll(Mid(WordRange.Text, InStr(WordRange.Text, "Enclosure:") + 10))
        
        'This gets only the sentence containing the word to find.  Using this so less text
        'to go through. --Tested without function to verify picking what we are searching for--
        'Since not sure if can only grab text from the Found to the end of sentence, using the
        'Access mid function for that purpose.
        '20190424
        With WordRange.Find
            .Text = "Enclosure:"
            .Execute
            If .found Then
                WordRange.expand unit:=3 'wdSentence
                stEnclosure = TrimAll(Mid(WordRange.Text, InStr(WordRange.Text, "Enclosure:") + 10))
            End If
        End With
        EnclosureText = stEnclosure
    End With
     
Error_Handler_Exit:
    worddoc.Close False
    Set worddoc = Nothing
    If bAppAlreadyOpen = False Then WordApp.Quit
    Set WordApp = Nothing
    Exit Function

Error_Handler:
     MsgBox "The following error has occured" & vbCrLf & vbCrLf & _
               "Error Number: " & Err.Number & vbCrLf & _
               "Error Source: PrintWordDoc" & vbCrLf & _
               "Error Description: " & Err.Description _
               , vbOKOnly + vbCritical, "An Error has Occured!"
    Resume Error_Handler_Exit
End Function

This is the function to extract a specific page (requires PDFtk be installed)
Code:
Function ExtractEnclosure(FileName As String, Optional pageno As Integer = 1)
'Grab the enclosure text from 1st page or other page as specified
'20190425
    Dim stfile As String
    Dim stExtractPages As String
    Dim stFolderName As String
    Dim stfilename As String
    Dim stPDFName As String
    Dim stEnclosure As String
    Dim retVal As Variant
    Dim yesno
    
    If InStr(Environ("Path"), "pdftk") = 0 Then
        MsgBox "It appears that PDF tool kit is not installed on this computer.  " & _
                "Please refer to user guide for instructions on how to install it. " & _
                "This program is needed in order to merge pdf files.", vbExclamation + vbOKOnly, "Missing Program PDFtk"
        Exit Function
    End If
    stFolderName = FILEPATH(FileName)
    stfilename = FileNameNoExt(FileName)
    stPDFName = stFolderName & stfilename & "_Enc.pdf"
    stExtractPages = Chr(34) & FileName & Chr(34) & " cat " & pageno & " output " & Chr(34) & stPDFName & Chr(34)
    If Len(Nz(stExtractPages, "")) > 0 Then
        retVal = Shell("pdftk " & stExtractPages, vbMaximizedFocus)
        MsgBox "Please wait for pdftk 'BLACK SCREEN' to close and then press OK to continue." & vbCrLf & vbCrLf & "Note:Big files may take longer to merge.  Please be patient", vbOKOnly + vbExclamation, "Files Merged"
        DoEvents
    End If
    ExtractEnclosure = EnclosureText(stPDFName)
End Function

This function shows how to extract text from a table. As noted in the comments, when the pdf is converted to word, the table may look the same, but the alignment may be different. In the table I was working with the col positions didn't match up in the code the way they did visually, so you may need to test out different col or row numbers to obtain the data you seek. Also, in this code there is a regular expression to assist in extracting the specific data, so if you don't need that part, you can comment it out.
Code:
Function ExtractReviewCode(sFile As String)
'This isn't being actively used, only set up to see
'if it could be done...
'1. To save time Extract specific page: EOR Review (page 2)
'2. Open with Word to locate the Engineer Review Code(s) within the table.
'   Visually, appears to be row 5 col 2, but there must be a hidden col as
'   location is row 5 col 3
'3. Gather cell data from the other 2 Engineer Review Codes on the page
'   in case they differ and a decision needs to be made as to which Code to use.
'Learned can convert pdf to word simply by opening the pdf in word from:
'https://social.msdn.microsoft.com/Forums/office/en-US/2d3a5cd3-38c3-4811-b3df-4f60848aec54/can-i-use-vba-to-convert-pdf-files-to-word-docs-or-text-files?forum=worddev
'Open word and the file, since no need to save
'or convert the file after it is opened, close the file without saving.  Only
'issue is that not sure how to by-pass the prompt to convert to Word.  Seems this
'popup doesn't always come to the front, so if not aware of it, may give the
'impression that the program is stuck/hanging/freezing when in fact is waiting
'for user to click on OK
'20190422
'The code in this link indicates a method to suppress the popup message
'https://social.msdn.microsoft.com/Forums/en-US/57f49b29-58cf-45e5-97cc-985fdd9a8c17/open-pdf-in-word-without-conversion-message?forum=isvvba
'On that page, search for:  Graham MayorMVP Sunday, March 13, 2016 9:55 AM (above that is the code)
'20190422
    Dim WordApp             As Object
    Dim worddoc             As Object
    Dim sFileName           As String
    Dim bAppAlreadyOpen     As Boolean
    Dim stReviewCode1       As String
    Dim stReviewCode2       As String
    Dim stReviewCode3       As String
    Dim stComments          As String
    
    bAppAlreadyOpen = True
    'Get an instance of word to work with
    'See if Word is already running
    On Error Resume Next
    Set WordApp = GetObject(, "Word.Application")
    If Err.Number <> 0 Then
        'Launch a new instance of Word
        Err.Clear
        On Error GoTo Error_Handler
        Set WordApp = CreateObject("Word.Application")
        bAppAlreadyOpen = False
    End If
    
    'Determine the Word doc filename without the path or extension
    sFileName = FileNameNoExt(sFile)
    'Open the document
    Set worddoc = WordApp.Documents.Open(sFile)
    worddoc.ActiveWindow.View.ReadingLayout = False
    WordApp.Application.Visible = True
    If worddoc.Tables.Count = 1 Then
        With worddoc.Tables(1)
            stReviewCode1 = .cell(5, 3).Range.Text
            stReviewCode2 = .cell(14, 3).Range.Text
            stReviewCode3 = .cell(23, 3).Range.Text
        End With
        SubmittalReviewCode = RegExData(stReviewCode1, "[a-zA-Z]+") & ", " & RegExData(stReviewCode2, "[a-zA-Z]+") & ", " & RegExData(stReviewCode3, "[a-zA-Z]+")
    End If
Error_Handler_Exit:
    worddoc.Close False
    Set worddoc = Nothing
    If bAppAlreadyOpen = False Then WordApp.Quit
    Set WordApp = Nothing
    Exit Function

Error_Handler:
     MsgBox "The following error has occured" & vbCrLf & vbCrLf & _
               "Error Number: " & Err.Number & vbCrLf & _
               "Error Source: PrintWordDoc" & vbCrLf & _
               "Error Description: " & Err.Description _
               , vbOKOnly + vbCritical, "An Error has Occured!"
    Resume Error_Handler_Exit
End Function


Other functions used by above
Code:
Function FileNameNoExt(strPath As String) As String
 'https://sqlaccxl.wordpress.com/2013/03/06/vba-function-to-extract-file-name-withwithout-file-extension-or-path-only/
    Dim strTemp As String
    strTemp = Mid$(strPath, InStrRev(strPath, "\") + 1)
    FileNameNoExt = Left$(strTemp, InStrRev(strTemp, ".") - 1)
End Function
 
'The following function returns the filename with the extension from the file's full path:
Function FileNameWithExt(strPath As String) As String
    FileNameWithExt = Mid$(strPath, InStrRev(strPath, "\") + 1)
End Function
 
'the following function will get the path only (i.e. the folder) from the file's full path:
Function FILEPATH(strPath As String) As String
    FILEPATH = Left$(strPath, InStrRev(strPath, "\"))
End Function

Public Function RegExData(txt As String, Optional pttrn As String) 'MyNumber(txt)
'https://www.experts-exchange.com/questions/26503365/Regular-Expression-to-extract-decimal-number-from-text-string.html
'Keeps only the pattern from the string
'20180823
    Dim regEx As Object, myMatch As Variant

    Set regEx = CreateObject("vbscript.regexp")
    On Error GoTo notFound
    If regEx.Test(txt) = True Then
        With regEx
            .IgnoreCase = True
            .Multiline = False
            .Global = True
            '.Pattern = "\d+\.?\d*"

            'Match 3 digits; a dash; 3 digits; and an optional non space character
            '[0-9]{3}-[0-9]{3}\S?
            'Input: 2018.02.20 Pay App 903-014 Feb 2018.pdf
            'Result:903-014

            'Input: 2018.02.20 Pay App 903-014A Feb 2018.pdf
            'Result:903-014A
            If Len(pttrn) = 0 Then
                .Pattern = "[0-9]{3}-[0-9]{3}[a-zA-Z]?"
            Else
                .Pattern = pttrn
            End If
        End With

    Set myMatch = regEx.Execute(txt)
        RegExData = myMatch(0)
    End If
    Exit Function
notFound:
    RegExData = "Not Found"
    
End Function
 

dcavaiani

Registered User.
Local time
Today, 14:05
Joined
May 26, 2014
Messages
385
SO far, nothing is looking that good. On the app I have working now, I copy and paste from a web page. I'm wondering if the way to go is to convert the pdf to a "same looking/ formatted" web page from which I can copy and paste as I do in the app I have now ?
 

CJ_London

Super Moderator
Staff member
Local time
Today, 20:05
Joined
Feb 19, 2013
Messages
16,606
depends on the volume but you might want to investigate OCR

MS does have a library that converts an image to text - see these links, but does depend on what version you are using, it moved from MODI (but you might be able to late binding) and is now part of OneNote

https://stackoverflow.com/questions/49480280/ocr-from-access-via-vba
https://support.office.com/en-gb/ar...-onenote-93a70a2f-ebcd-42dc-9f0b-19b09fd775b4
https://www.excelguru.ca/forums/sho...prt-scr-image-based-text-to-real-text-in-cell

so perhaps take a snapshot of the screen?

Have you considered just copy/pasting the bits you need or investigated scraping?

alternatively
I have a situation where I can only get PDF's, I run them through my scanner and select OCR and get a text file - I then parse the text file to get the data and make the odd correction (1's and 0's being interpreted a l,I or O)

From
 

sxschech

Registered User.
Local time
Today, 12:05
Joined
Mar 2, 2010
Messages
792
Another option if you have a very big budget, you could look at Monarch from Altair/Datawatch.
 

deletedT

Guest
Local time
Today, 20:05
Joined
Feb 2, 2019
Messages
1,218
SO far, nothing is looking that good. On the app I have working now, I copy and paste from a web page. I'm wondering if the way to go is to convert the pdf to a "same looking/ formatted" web page from which I can copy and paste as I do in the app I have now ?

I don't know if it helps or not. But it's what I'm doing now.

Our company receives the orders from an online application. We used to save the order page as pdf to have a copy, print the pdf for manufacturing department, and input a lot of items in our Receiption table in an Access dateabase. There are a lot of items to input. OrderNo, DrawingID, DrawingNo, DrawingName,Quantity, Delivery and a lot more. It was so frustrating and time consuming.

I was asked to find a way to make the process quicker.

  • Instead of printing the page, I drag and copy the whole online order page.
  • Paste the copied section into an Excel Sheet
  • Wrote a function in Excel to extract all the data we need and send it to the Receiption table in Access database.

So simply for each order we receive, three steps are done. Copy, Paste And click a button. and everything is done.
The good thing is that I setup a triple monitor environment for the person who normally does this job. She has Excel, Browser and Access on three monitors and doesn't need to Maximize/Minimize the windows.

If you try to be a little more specific about your problem or what is your current steps, maybe someone (or me) can give some advice that can help.
 
Last edited:

Users who are viewing this thread

Top Bottom