Extract PDF data to table

steve21nj

Registered User.
Local time
Yesterday, 18:57
Joined
Sep 11, 2012
Messages
260
Through searching, I have seen a lot of post regarding 3rd party downloads to extract pdf data to a table, has anyone developed coding or a sample database that doesn’t require 3rd software? I found a site that does the opposite, but wondering if it could be reversed engineered.

In the end, I want to import the pdf data to the table, then attach the pdf file automatically.

http://www.vbaexpress.com/forum/showthread.php?21503-Populate-PDF-Form-via-VBA

I have a pdf form that was developed in Adobe Live Cycle.
For sake of example, let’s say my form object names are:
TextField1
TextField2
DateField1
CheckBox1
CheckBox1[3]

Any thoughts?
 
I've been playing with some sample code but having a difficult time with the information I am trying to capture as listed above. My pdf is 2 page, with multiple fields.

Any recommendations?

http://www.eileenslounge.com/viewtopic.php?f=30&t=5907


Code:
[FONT=Courier New]Public Function ReadAcrobatDocument(strFileName As String) As String[/FONT]
[FONT=Courier New]'Note: A Reference to the Adobe Library must be set in Tools|References![/FONT]
[FONT=Courier New]Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc[/FONT]
[FONT=Courier New]Dim AcroHiliteList As CAcroHiliteList, AcroTextSelect As CAcroPDTextSelect[/FONT]
[FONT=Courier New]Dim PageNumber, PageContent, Content, i, j[/FONT]
[FONT=Courier New]Set AcroApp = CreateObject("AcroExch.App")[/FONT]
[FONT=Courier New]Set AcroAVDoc = CreateObject("AcroExch.AVDoc")[/FONT]
[FONT=Courier New]If AcroAVDoc.Open(strFileName, vbNull) <> True Then Exit Function[/FONT]
[FONT=Courier New]' The following While-Wend loop shouldn't be necessary but timing issues may occur.[/FONT]
[FONT=Courier New]While AcroAVDoc Is Nothing[/FONT]
[FONT=Courier New]Set AcroAVDoc = AcroApp.GetActiveDoc[/FONT]
[FONT=Courier New]Wend[/FONT]
[FONT=Courier New]Set AcroPDDoc = AcroAVDoc.GetPDDoc[/FONT]
[FONT=Courier New]For i = 0 To AcroPDDoc.GetNumPages - 1[/FONT]
[FONT=Courier New]Set PageNumber = AcroPDDoc.AcquirePage(i)[/FONT]
[FONT=Courier New]Set PageContent = CreateObject("AcroExch.HiliteList")[/FONT]
[FONT=Courier New]If PageContent.Add(0, 9000) <> True Then Exit Function[/FONT]
[FONT=Courier New]Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)[/FONT]
[FONT=Courier New]' The next line is needed to avoid errors with protected PDFs that can't be read[/FONT]
[FONT=Courier New]On Error Resume Next[/FONT]
[FONT=Courier New]For j = 0 To AcroTextSelect.GetNumText - 1[/FONT]
[FONT=Courier New]  Content = Content & AcroTextSelect.GetText(j)[/FONT]
[FONT=Courier New]Next j[/FONT]
[FONT=Courier New]Next i[/FONT]
[FONT=Courier New]ReadAcrobatDocument = Content[/FONT]
[FONT=Courier New]AcroAVDoc.Close True[/FONT]
[FONT=Courier New]AcroApp.Exit[/FONT]
[FONT=Courier New]Set AcroAVDoc = Nothing: Set AcroApp = Nothing[/FONT]
[FONT=Courier New]End Function[/FONT]

call the function

Code:
[FONT=Courier New]Sub Demo()[/FONT]
[FONT=Courier New]Dim strPDF As String, strTmp As String, i As Integer[/FONT]
[FONT=Courier New]' The next ten lines and the last line in this sub can help if[/FONT]
[FONT=Courier New]' you get "ActiveX component can't create object" errors even[/FONT]
[FONT=Courier New]' though a Reference to Acrobat is set in Tools|References.[/FONT]
[FONT=Courier New]Dim bTask As Boolean[/FONT]
[FONT=Courier New]bTask = True[/FONT]
[FONT=Courier New]If Tasks.Exists(Name:="Adobe Acrobat Professional") = False Then[/FONT]
[FONT=Courier New]bTask = False[/FONT]
[FONT=Courier New]Dim AdobePath As String, WshShell As Object[/FONT]
[FONT=Courier New]Set WshShell = CreateObject("Wscript.shell")[/FONT]
[FONT=Courier New]AdobePath = WshShell.RegRead("HKEY_CLASSES_ROOT\acrobat\shell\open\command\")[/FONT]
[FONT=Courier New]AdobePath = Trim(Left(AdobePath, InStr(AdobePath, "/") - 1))[/FONT]
[FONT=Courier New]Shell AdobePath, vbHide[/FONT]
[FONT=Courier New]End If[/FONT]
[FONT=Courier New]'Replace FilePath & Filename with the correct FilePath & Filename for the pdf file to be read.[/FONT]
[FONT=Courier New]strPDF = ReadAcrobatDocument("FilePath & Filename")[/FONT]
[FONT=Courier New]ActiveDocument.Range.InsertAfter strPDF[/FONT]
[FONT=Courier New]If bTask = False Then Tasks.Item("Adobe Acrobat Professional").Close[/FONT]
[FONT=Courier New]End Sub[/FONT]
 
:confused: bump
 
Still searching. Was also looking at converting the pdf to txt, but still unsure how to reference the structure or specs.

I created a sample database with the file information below if anyone can point me in the right direction.


PDF File Naming
PDF Name (Left) : PDF Object Name (Right)

Today’s Date : DateTimeField1
Name: TextField1
Number: TextField4
Issue: TextField7
What is the issue: TextField9
Restart: CheckBox1[1]
Unplug: CheckBox1[0]
Help Desk: CheckBox1[3]
Other: CheckBox1[4]

Access Table Naming
Column Name (Right) / PDF Object Name (Left)

TodaysDate : DateTimeField1
UserName: TextField1
PhoneNum: TextField4
Issue: TextField7
IssueDecription: TextField9
CheckFollowing: CheckBox1[1]
CheckFollowing: CheckBox1[0]
CheckFollowing: CheckBox1[3]
CheckFollowing: CheckBox1[4]
 

Attachments

I have a free app that from command line can digest a pdf and spit it out as text - I use it to read PDF-invoices and store the data in an access db.

I attach your filled form and the corresponding text file. The "converter" is called pdftotext.exe and is part of a freee XPDF suite - google it.
 

Attachments

Thank you for the files.

When running through the import txt steps, and since the text is all over, with only allowing for vertical line breaks, how would i get this to work properly.....? Right now the data would be placed in several records, rather than one.

But let me elaborate a bit, lets say the database will be used by 10 computer illiterate people, and they will have to do the pdf to txt convert on their own....without having to manually align the txt file per each file, is there a way to set up automatic horizontal breaks that will allow for an easy import?
 

Attachments

  • verticle table.PNG
    verticle table.PNG
    15.2 KB · Views: 351
  • table data.PNG
    table data.PNG
    47.4 KB · Views: 396
You need to read the text line by line, and using VBA string functions (google it) check what is in each line. The pdftotext.exe and the pdf file to be treated you need to specify in a string and run it in a shell (google VBA shell). Then you need to wait a little so that the shell can process the file and spit out the text file. All this can be entirely hidden from the user and ran from vba (with the exception of supplying the path of the pdf file to be treated).

Have fun.
 
So I've been working on the code for two different pdf forms.
One form was created in adobe pro. I can basically get the form to do what I want, which is fill the access field on the form and display a text box with that particular data.

The second form was built in adobe livecycle. I use the same exact code, with the same structure, and it will not work.
I keep getting a run-time error '424' object required error.

Has anyone experienced this issue with livecycle? While the thought of just building the new form in adobe pro might work, I am using this example for another project that already has livecycle forms in circulation. So redistributing a new form is out of the question.

Suggestions?


Code:
Private Sub Command2_Click()

Dim FileNm, gApp, avDoc, pdDoc, jso
Dim Text1 As String
FileNm = "c:\temp\test.pdf" 'File location
Set gApp = CreateObject("AcroExch.app")
Set avDoc = CreateObject("AcroExch.AVDoc")
If avDoc.Open(FileNm, "") Then
    Set pdDoc = avDoc.GetPDDoc()
    Set jso = pdDoc.GetJSObject
Text1 = jso.getField("TextField1").Value
Me.Text2 = jso.getField("TextField1").Value
MsgBox Text1
    pdDoc.Save PDSaveIncremental, FileNm 'Save changes to the PDF document
    pdDoc.Close
End If
 
'Close the PDF; the True parameter prevents the Save As dialog from showing
avDoc.Close (True)
'Some cleaning
Set gApp = Nothing
Set avDoc = Nothing
Set pdDoc = Nothing
Set jso = Nothing
End Sub
Private Sub Command3_Click()
 
Dim FileNm, gApp, avDoc, pdDoc, jso
Dim Text1 As String
FileNm = "c:\temp\sampleForm1.pdf" 'File location
Set gApp = CreateObject("AcroExch.app")
Set avDoc = CreateObject("AcroExch.AVDoc")
If avDoc.Open(FileNm, "") Then
    Set pdDoc = avDoc.GetPDDoc()
    Set jso = pdDoc.GetJSObject
Text1 = jso.getField("TextField1").Value
Me.Text4 = jso.getField("TextField1").Value
MsgBox Text1
    pdDoc.Save PDSaveIncremental, FileNm 'Save changes to the PDF document
    pdDoc.Close
End If
 
'Close the PDF; the True parameter prevents the Save As dialog from showing
avDoc.Close (True)
'Some cleaning
Set gApp = Nothing
Set avDoc = Nothing
Set pdDoc = Nothing
Set jso = Nothing
 
End Sub
 

Attachments

Can anyone provide any feedback with my adobe livecycle issue? Or maybe I am missing something easy.
 
It errors as soon as it tries to read the PDF. Its using the same structure as first PDF, but it errors on the pdf created with adobe live cycle

Code:
Text1 = jso.getField("TextField1").Value
 
Field names are arbitrary. Although the Adobe Pro uses it, the Live Cycle form probably uses different names and doesn't have a field called "TextField1".
 
Understandable on the names but I built my adobe pro form based on the live cycle names. So I'm 100% percent certain the name is correct.
 
youtube.com/watch?v=gMYdBfl61is

Extract data from PDF files on Mac including scanned PDF files
 

Users who are viewing this thread

Back
Top Bottom