Solved Scraping a table from Website

John,
Just for clarity. The status at the moment is you and HE have a strategy that is progressing and will get what you need.
??Is your client who needs this data a "customer" of whoever has the website??
I have been in situations where all sorts of manipulation/transformation/"hand-raulic mechanisms" were employed to get at some data. But in reality, asking the powers to be for data in a specific(useable) format got immediate action and streamlined "extraction" and communications. Just sayin'
 
Morning Jack,

*deep breath*

Full disclosure - The "customer" is my wife's company, Performance Food Groups (PFG). If you buy food in North America, or pretty much the developed world for that matter, PFG was most likely involved with the distribution process.

She works in the Human Relations department, and more specifically, the information systems branch of the HR department, or HRIS. If an employee or regional HR representative has an issue with one of PFG's websites, she and her team are the ones you call.

PFG hired a consultant company, UKG, to develop a site that handles their complaints/trouble calls, or what they refer to as Case Management. They built it, PFG bought it and now it is theirs. Any modifications, such as an API and better export process, would require a separate contract and more $$$.

Skip ahead a few months and the boss of my wife's boss wanted metrics: How many cases and have/have not been resolved, where are the majority of the cases originating from, what ARE the issues, how much time is it taking to resolve, etc.

"No problem!" they said, "We have this nifty export feature - we'll just do an export, get it into Excel and Bob's your uncle!" Well, Uncle Bob's export sucks. I made a post about it few months ago and I wont even get into how bad it is right now. Her boss set about employing her Excel mastery to produce something of value.

Several weeks later, she threw in the towel and asked my wife to engage. Missy (wife) had already been hacking away at it and had asked me to take a look. I did and after about 2 weeks of cussing and discussing, I was able to give her about a 90% solution that required human interaction for the other 10%. I suppose I could have eventually worked it out but I was tired of dealing with it and my db is so bad I wont share it here because of the amount of chastising I would deservedly receive.

Missy was able to get her people the reports needed within a week of doing an export and that was good enough...until it wasn't. Her upper management was not in love with the reports (no charts, line graphs etc.) and insisted they bring in the Power-Bi team to generate the reports worthy of their status. She told them that it would be ok, once they figured out how to get clean data. When asked to explain, she had me join a Teams meeting with the elite so I could explain the issues and the processes I use to clean the data. I again tried to impress upon them the need for an API and/or access to the server - after their eyes glossed over, I was excused and they adjourned.

They reconvened, told Missy that an API/Direct access was a no-go and that they wanted me to sit down with their Power-Bi team and explain the process so that they could emulate it. She explained to them that even if I did that, the remaining clean-up would still take a dedicated team hours to accomplish depending on the amount of dirty records. They asked her if I could do something about the remaining 10% and that they would be willing to compensate me for my time.

*whew*

Which brings us back to the beginning of this thread...
 
Quite a saga. I've also seen many managers/bosses who have no/zip knowledge of the value of information they are responsible for - nor any understanding of data management. I once worked in an area where one of the intermediate bosses came to me with --" I hear you know something about Access. Can you give me a little help?"
Sure I said -what's the project/subject matter? He--"I'm going to build a database to monitor quality assurance on several of our websites." Why, I asked. There is a whole team dedicated to QA of those sites that reports to you."
"Really, he said, I didn't know exactly what they did." "I also told him that QA reports were available and who the responsible manager was". His career in Access never really got started, at least not at the office. I think he did some personal stuff at home. I retired 16 yrs ago and he retired shortly after.

As an aside, I got a joke from a friend the other day that may fit into this post--when it comes to remembering and getting older.

"When I say, "The other day," I could be referring to any time between yesterday and 15 years ago."
 
"No problem!" they said, "We have this nifty export feature - we'll just do an export, get it into Excel and Bob's your uncle!" Well, Uncle Bob's export sucks. I made a post about it few months ago and I wont even get into how bad it is right now. Her boss set about employing her Excel mastery to produce something of value.

They reconvened, told Missy that an API/Direct access was a no-go and that they wanted me to sit down with their Power-Bi team and explain the process so that they could emulate it. She explained to them that even if I did that, the remaining clean-up would still take a dedicated team hours to accomplish depending on the amount of dirty records.

First, I won't be able to provide any programing assistance. Sorry about that. :(
But, based on the limited information of two paragraphs above, it would seem that a two step approach may be be a "solution" if you aren't already doing it. This also assume that I understand what you are attempting to accomplish.

First step: scrape the data as displayed. Then import it into an excel file (or simply a text file) on your local PC without any modification.
Second step: Review the data in the excel spreadsheet (or text file) to examine how best to identify and filter out the "dirty records" through automation. This may require the development of a program to accomplish that task. A potential problem, can "dirty records" be identified (automatically) without human inspection?
 
Good morning folks. I am enthusiastically marking this thread "Solved".

First and foremost, an HUGE shout-out to @Edgar_ . I remember when he joined AWF back in July, it seemed he came out of nowhere but immediately made an impact and I knew that AWF had gained another "Heavy Weight".

When I started this thread, he responded via PM with a single sentence: "Hey there, John, I saw you need a hand, I could take a look..." He wasn't the only one, but since he was first, and he had helped me in the past with this same issue, I responded and we set to work the next morning.

What he did was simply amazing. I will hit the high-spots and even share the db with an example of the JSON returned from the site if anyone is interested.

As I said earlier, the site actually uses a REST API to interface with the one that holds the raw data. He was able to see this in the HTML and once he gained access, made the following observations:

1. The site allows you to "pull" 100 records in JSON format at one time.
2. You only have an hour once your credentials (token) has been activated to get your business done.

Once the JSON is retrieved, he used a JSON Converter (VBA-JSON v2.3.1 (c) Tim Hall - https://github.com/VBA-tools/VBA-JSON) to parse the data into a table. As of this morning, with 22,183 records, it took 1 hour and 18 mins to fetch and process this data. Fortunately, we anticipated this early on and to get around the 1 hour limit, we decided to break it up into 2 section; Fetch and Process.

The fetch routine goes and gets the JSON and stores it into a table with only two fields, an AutoNumber(PK) and LongText, for the record the average length of each JSON field is about 350k characters. The whole routine takes about 30min . The 2nd process then uses the JSON Converter and populates the appropriate table.

We know that eventually the fetch process will exceed the 1hour limit and he has offered to build in the ability to assign a new token (how do folks know how to do this?!?) and I may take him up on that, but there are other and more efficient ways to deal with the data that come over i.e. there is no need to fetch archived records (17k at present) each and every time.

Missy (wife) had a Teams meeting with her bosses who had brought in a couple of IT folks to help us get clean data. It didn't last long since Edgar had already cracked that nut. After the meeting, her bosses' boss responded with a word: "FANTASTIC!"

Fantastic is an understatement. While we were working on this, I did my best to assist. Edgar was patient and painstakingly explained to me the JSON process while I muddled through it. Quite frankly, he would have it done a lot sooner had I not "helped". Even now he hits me up on WhatsApp with excited texts about process improvement.

@Edgar_ you are a Rock Star and as promised, what they "pay" me will be passed along and you have my "mark" in perpetuity.

Thank you...
 
I have been able to see table information from a site that requires a password. I can inspect the table and see the columns and the data...but no matter what I try, I cannot extract the data.

Several methods I have found using Dr. Google have failed and about 4 hours of ChatGPT have gotten me a bunch of code that does not work. It even gave me code to loop through the elements to find ANY table and it returns a "No Tables Found" message - by design of course.

I KNOW there is a table but something is keeping me from getting there...

The good news is that the company has agreed to compensate me for my time - which I will pass along to whoever wants to get me over this hump. PM me if you are interested.

One thing to try when all else fails. Try the simple "get data" from a webpage in Excel.
If you are lucky and this is one of the very rare times that it actually works, you can then vba automate the refreshing of it.

I did this once when I was wanting to be alerted for craigslist ads. I 'get data' from 'website' in Excel manually, then set vba to automate it.
if the refreshed data ever exceeded x-# of rows, I'd know there was a new craigslist ad for the search terms
 
One thing to try when all else fails. Try the simple "get data" from a webpage in Excel.
If you are lucky and this is one of the very rare times that it actually works, you can then vba automate the refreshing of it.
Tried that as well - this issue was that it only pulled the first 25 records for the same reason(s) mentioned in thread #18. This project was a lot of fun and I have learned a lot. Edgar has even offered to show me how to find these things I mentioned once you get "under the hood"
 
Good stuff John. You should be in line for a raise or promotion!
Glad Edgar was able to solve the issue.
I'm sure it would be a great addition to Sample Databases or similar.
I'd like to see it and I'm just a dabbler.
Congrats Edgar(y) and John(y)
 
UPDATE:

So here we are over 3 months later and the "professionals" have yet to duplicate what has already been done.

Once the proof of concept was showcased, the marching orders to the company's IT department was to duplicate this with Power BI. I was asked to sit in on the conference since I was the SME (if these folks only knew...) After a bit of back and forth, I suggested to simply use the Access App that works perfectly and then use whatever Power platform they want to generate the reports.

That got shot down with an explanation on how they wanted a one-stop-shop, "blah, blah, blah". What was not said is "But, it's Access!" which is the real reason. Fine, great, the hard work has been done, a simple reverse engineer and you should be able to do it.

My wife gets tapped at least twice a month to provide reports with the stop-gap, inferior tool until they can make the "right" tool work. This morning, there was an email from her IT department:

Hi Melissa

While we are trying to generate the access_token using the Post method, we are getting this error. Can you please help with this?


image003.png


Thanks

Rajan

Keep in mind, my wife is an HR specialist with NO formal education in Computer Science and NO IT training. The experts have been hacking away for months trying to do what Edgar was able to achieve in 2 weeks.

I am at a loss for words.

The rest of that email thread has prior emails where they asked her to provide the Application id, Application secret, and Client UUID. She asked me what these were and I did not really know, but I assumed they had something to do with requesting data from the REST API. Since the method developed with Access required a successful login and then using a Bookmarklet to capture the Token, this info was not needed.

As fate would have it Daniel Pineault posted a video yesterday that demonstrated what this info is and then how to incorporate it with using Gmail. He made this video and article in an attempt to get ahead of the Outlook train wreck that is coming. Funny how the universe will provide you with what you need if you are receptive to it...

Not sure if I will throw them a life-line or just enjoy the show...
 
Since the method developed with Access required a successful login and then using a Bookmarklet to capture the Token, this info was not needed.
I think what we're experiencing with PFG is a good example of the Golden Hammer analogy. They want to do everything with a tool that might be limited for that, or might introduce additional hassle. I've used Power Query - another Power Whatever Tool - for HTTP and while I do get some results, they're hard to customize and automate, so I'm not sure if they can do everything they need with Power BI, but when a tool falls short, mediator/bridge patterns can often solve the problem.

Take the bookmarklet step as an example. Since I did not know all of the details of PFG's IT department setup and solution scope, I did not add the authentication logic directly into the Access method, because it would take further testing and investigation when we could provide a quick and easy solution, in exchange of an extra step. Can Access do it all within itself? yes, totally. Maybe Daniel demonstrates some of that in his video. I haven't watched it, but I did his post and I saw he was attempting some authentication using Google's API. In this case, we don't have a public API, but we have an online form, so we can monitor what it does, see what it requests, what it receives, implement it with Access and then parse it.

Seeing their struggles with the incorporation of Authentication logic using Power BI, perhaps they should explore external options. That begs the question: Is Access the optimal external option tool for this task? Of couse not, but they already have it there. Choosing the right tool often involves weighing up learning to use a new one and leveraging what we already have and know. That might be their predicament. That or maybe their stubborness and IT pride. So... I guess the decision of helping depends on the business side of it, but they should acknowledge there are times when we can't do everything in one place given the golden constraints of time, cost and quality.
 

Users who are viewing this thread

Back
Top Bottom