Question What's faster than Access w/ODBC

anotherProgrammer · Sep 20, 2010

I need to manipulate large db2 tables (1 to 10 million records). Example: Join three tables, get about 50,000 records based on one or two criteria from each table, put it in a local table. Right now I am using ODBC through access, but it is ridiculously slow. My attempts at using pass through queries have been thwarted by the 65k character limit in a query.

I have also tried using SAS, but that doesn't seem much faster. Is there anything else I can try? I have been waiting literally 4 hours for my most recent query to run.

How do other people deal with large data sets?

boblarson · Sep 20, 2010

Normally I would work with the database owners to create a Stored Procedure or two to pull a smaller set of data based on parameters passed. I don't know db2 enough to know if that is something that even exists there but the key with large datasets is to let the DATABASE SERVER do most of the work and return only what you need.

With your current setup it is likely that a lot of data that isn't required is probably being drawn into Access where it then limits the data. Access can form some fairly good queries which the server can run to bring back limited data sets, but you have to not include things that Access needs to do in Access only - like functions, domain aggregates, grouping (I believe), etc.

Banana · Sep 20, 2010

The real limitation is hardware, not software; the difference in performance between one technology (e.g. ODBC) and other (e.g. OLEDB) is actually small and insiginificant. But regardless of the technology, the speed across network, and especially across internet is going to be much slower compared to writing to a local hard drive.

The problem is typically solved by working smarter/efficiently.

For example, do you actually need 50,000 records, or can you do with less? Or can you just do the processing on the server side? The latter one is very easy to miss since Access tries to be smart and hand off as much processing as it can to the server but if you force it to do something that cannot be translated to SQL, then it has no choice but to potentially load the whole table (ten million records!) and process locally to satisfy the criteria. If query is rewritten so it can be done entirely on server, then it will speed up significantly. Note that this applies when we are using Access queries - passthrough queries are a different story since they do no local processing at all but I'm not sure if you were using Access queries w/ linked tables or pass through query.

Does that help? Maybe we can have a look at how we have the query written right now?

anotherProgrammer · Sep 20, 2010

Yep good suggestions but I am very limited: Stringent corporate policy = no ability to control or access anything server side, don't even have access to speak to the IT folks that own the database. Also pretty crappy desktop hardware and Access 2002. Yep, really do need all 50,000 records. Even after writing it completely as a pass through query it is painfully slow. I was just hoping that there was some super secret speedy way to do it, with free software that you could install without administrator rights

boblarson · Sep 20, 2010

Well it seems like, unless by posting the SQL to the query here so Banana or others can take a look to see if it can be better optimized, that you are destined for some long wait times.

Banana · Sep 20, 2010

No, not really. You would need more ethernet cables and load balancer to make that happen, unfortunately.

It totally sucks that you can't do much, and since you've used passthrough query, the only improvement is to rewrite SQL so it execute faster on DB/2. It's not uncommon that one SQL statement runs faster than other even though they may return same results (a good example of this is choosing frustrated join instead of existence checks for non-matches for MySQL & Access but SQL Server actually prefers & optimize existence checks better).

Another idea - if your 50,000 records are same 50,000 records even partially, rewrite the query to only do the differential instead of reloading everything all again? That of course assumes you have a good way of detecting the changes such as rowversion or timestamp present in the table.

anotherProgrammer · Sep 20, 2010

Not sure what you mean by frustrated join, but I don't actually have a join statement, it is just in the where clause, like

SELECT
myField
FROM
myTable1,
myTable2
WHERE myTable1.Field=myTable2.field AND
myTable.myField='B'

Can't post actual SQL, I'm sure it would violate at least 57 corporate policies.

Another good idea, but I think this was built back in 1732, so there is no way to detect changes.

Banana · Sep 20, 2010

My goodness gracious. Sell that system to Smithsonian Institute! They'll flip over at the opportunity to own a database system that predates Edgar Codd!

I'm not familiar with DB/2 but it seems that you're using the older style theta-join rather than ANSI join. I know for Oracle it was the only way to do it up to quite recent version (9i?).

Frustrated join is basically a way to get non-matches... where one-side record exists but does not have any match in many-side tables:

Code:

SELECT a.*
FROM a
LEFT JOIN b ON a.ID = b.ID
WHERE b.ID IS NULL;

Existence check would be:

Code:

SELECT a.*
FROM a
WHERE EXISTS (
  SELECT NULL
  FROM b
  WHERE b.ID = a.ID
);

I'm sorry we can't share the specific SQL, but if your criteria is more complicated, it may be possible to generalize the problem and we can see what we can come up. I'm also assuming that there's an index on the columns that's used in the WHERE clause (and in case of ANSI join, ON clause(s) as well).

Hope that helps...

anotherProgrammer · Sep 20, 2010

Ok, nope, don't have to check for existence, these are all inner joins, and I have tried writing them that way too to see if it would be faster but it's not (well maybe in geologic time). The criteria are all really simple, like field 1 in ('A','B','C') and field 2 in (1,2,3), nothing the slightest bit fancy. Also no indices

Banana · Sep 21, 2010

Well, without indices, it's probably killing you since DB/2 is forced to scan the whole table.

You're actually 2nd person to mention this scenario where IT isn't going to index a field even though it's slow. WTF.

You have my fullest sympathy.

Out of interest, if you had IN(), does it make any difference if you omitted IN() and did a simple equity for single value, and executing the query 3 times?

anotherProgrammer · Sep 21, 2010

Thanks for the sympathy. I may try breaking it down as you suggest - the thought had crossed my mind but it seemed like too much trouble (though perhaps not more trouble than waiting 4 hours?)

The_Doc_Man · Sep 22, 2010

If you have no indexes on the server side, you are already whacked. There is your time-killer. You are doing what is called a relation scan in some circles. A non-keyed scan will do the kind of match-ups you are describing in "factorial" time frame. You need to reduce those odds.

I have a crazy, butt-ugly idea, and the worst that could happen is that you would tell me you don't have enough resources for this, either.

Take the significant portions of the DB2 tables and query them back to your workstation one table at a time, via maketable queries I guess. Assign indexing of the fields you are going to be joining.

You want to pull down enough data from the DB2 tables so that at some point you will be able to build a query that says "take record with key X from table A and merge that with record with key Y from table B..." etc.

Even if you can't build an index for DB2, Ac2002 CAN build an index. If you can easily determine the prime keys of the records, then you can do the narrowing of the field on your machine locally. THEN go back and pull in only what you need from DB2. If you can do that, it might be workable.

I say this MIGHT be workable because at least based on "factorial time" vs. n*(n-1) time frames, Access 2002 with indexes might be faster than DB2 on a bigger machine without indexes. No, I'm not kidding.

I would of course make a Macro to "stage" the pull-down-from-DB2 queries, the "merge this mess to get the match-ups I want" queries, and the "pull down only the relevant records from DB2" queries in the proper sequence. This isn't something you would want to do by hand.

gemma-the-husky · Sep 22, 2010

i think maybe the doc man has a way forward. what data are you trying to extract that needs two tables joining on on-indexed fields? How many records od you expect to end up with?

maybe your IT has a point, and you DONT really need this over the whole population

try it the way he suggested. do maketable to select a desired subset of each table in access, and then use access to join the new tables.

anotherProgrammer · Sep 22, 2010

I appreciate all the suggestions that have been made so far.

I have actually tried pulling the whole tables, because I agree that the lack of indexing is what kills it. However, whenever I try to download the whole table, I get a message "system resources exceeded". Even if I apply indexes as I am linking the table, and then perform the query on the linked tables (instead of as a pass through), I get the same message. It is just too much data for my little desktop. And yes, I really do need all 50,000 records - the deliverable in this case was a positional text file of all the records.

On a previous suggestion, performing the query in sections (i.e., first pull criterion A, then do another one for criterion B, etc., then union) did seem to be faster than running the full query at once, so that will probably be my technique for the time being.

I have also tried querying one table at a time. The problem is, it doesn't work: I query one table to get the keys to the second table, then I try to run the second query joining on the local table with the keys from the first table (which is now indexed), but the query no longer runs as a pass through (because of the local table) and I get the system resources exceeded message. If instead of joining on the local table, I create a string with all of the keys in it, in order to let the second query still run as a pass through, I hit the string length limit for a query (65k characters).

Just trying to explain that I really have tried all of the typical techniques for this type of situation - I am just really stumped.

The_Doc_Man · Sep 22, 2010

"System resources exceeded" tells me that something is not being closed correctly, I'm guessing file handles - which implies the connection strings to your ODBC tables because they are files as well as other types of things at the same time.

You need to perhaps pull down the tables raw, then apply the indexes after-the-fact. Do not apply the indexes on the fly because you will "thrash" your memory with that many records.

My thought here is that if you keep on running into roadblocks, there is another way to get around them. How badly does your boss want this information (and how quickly)? Get your boss behind you and have HIM slam your IT guys for not cooperating. Having lived and worked in a political environment for over 20 years, I have learned that sometimes the way to go forward is to get a higher-ranking person on board with the problem. Get that person to work the politics while you work the techie stuff.

Banana · Sep 22, 2010

FWIW - it was earlier mentioned that there was 10 million records in all - I don't know how wide the table is but if it exceeds 2 GB, then local download may not even an option.

anotherProgrammer · Sep 22, 2010

Again, thanks for the suggestions and encouragement.

The system resources exceeded happens regardless of how I download it (plain or not) - I think it is just too much data, period.

Basically the management does not have a problem with the amount of time it takes to run these requests, so I'm not under any pressure to make it faster. And it's really not an issue of the IT people not cooperating - there is simply no way to even contact them. Its just that I personally like to find the most efficient way of doing things.

Since I seem to have exhausted my options RE Access, and I haven't worked with SAS much before, I am thinking that I need to invest some time in developing those skills - perhaps I can make more progress in that direction.

Banana · Sep 22, 2010

If I may, can I ask why there's no way to contact IT? That seems an odd arrangement. Surely someone has to least talk with them once every blue moon? What happens if the four horsemen of apocalypse decreases and lay the infrastructure to waste? They just don't show up?

As for using different technology - if everything you've said to the date is true - DB/2 having no indexes at all - then it would be basically the same story, regardless of what technology you use. DB/2 has to be configured to index those tables to work efficiently. No way around this.

Just as an other perspective - I did have a project where we used an Oracle server that rotated through a large dataset, truncating & reloading once in every while. One department needed to have a history of past dataset so they purchased a SQL Server to basically download the data from Oracle before it got truncated & reloaded so they could perform analysis. I suppose you could look at using SQL Server Express (if you get 2008 R2, it allows you to go as big as 10 GB) or MySQL / PostgreSQL for unlimited expansion and no processing cap).

Question What's faster than Access w/ODBC

anotherProgrammer

New member

boblarson

Smeghead

Banana

split with a cherry atop.

anotherProgrammer

New member

boblarson

Smeghead

Banana

split with a cherry atop.

anotherProgrammer

New member

Banana

split with a cherry atop.

anotherProgrammer

New member

Banana

split with a cherry atop.

anotherProgrammer

New member

The_Doc_Man

Immoderate Moderator

gemma-the-husky

Super Moderator

anotherProgrammer

New member

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

anotherProgrammer

New member

Banana

split with a cherry atop.

Similar threads

Users who are viewing this thread