Arrays vs Recordsets

Galaxiom · May 6, 2010

I am extracting data from text files. Ultimately the output will be provided to and operated from Excel so I am doing the whole thing in code using ADO connections and recordsets.

I initially did the text line selection using information from hard coded arrays of strings. When I changed to retrieving this information from a database I wrote a function to return an array which plugged straight into my original parsing structure.

Continuing in the same fashion, for the output recordset structure I used a function to return a two dimensional array of field properties which was then used to construct the recordset. The same array holds the parsing information used to define the start position and width of the data for each field.

All works well but I realised I could achieve the same by just using the recordsets rather than arrays and changing the code to step through them.

One thing I do like about the two dimensional array holding the parsing information is the ability to use a single line making explicit reference to a particular array row using the same counter as the destination field index rather than moving the Field Information recordset independently.

Code:

For f = 0 To fu  ' fu = Ubound of the field information array
    FieldName = rsDataOut.Fields(f).Name
    rsDataOut.AddNew FieldName, Mid(InputString, FieldArray(f, 3), FieldArray(f, 4))
Next

Rather than:

Code:

For f = 0 To reccount - 1 ' reccount = rsFieldInfo.Recordcount
    FieldName = rsDataOut.Fields(f).Name
    rsDataOut.AddNew FieldName, Mid(InputString, rsFieldInfo.Fields(3), rsFieldInfo.Fields(4))
    rsFieldInfo.MoveNext
Next

I would be interested in any perspectives about advantages using an array or a recordset in this kind of situation.

Clearly the array must be first created using a MoveNext loop so it is double processing where the array is used only once. But what about where the array was then applied to parsing thousands of lines. Could accessing values from an array be faster than repeatedly moving through a recordset?

Banana · May 6, 2010

Very interesting question, Galaxion.

While I would imagine that arrays are faster than recordsets simply by the virtue that they're lightweight, they're also somewhat inflexible.

Areas where recordset wins:

1) Adding or deleting rows is easy to do. With Arrays, we have to either pre-allocate or re-dimension, either which could be expensive and wasteful. Recordset will take care of allocating & de-allocating when we issue an .AddNew and .Delete and thus the memory footprint will always be sized to fit.

2) Sorting, filtering and finding is possible against a recordset. Not so easy against an array, unless you made it a part of its structure (e.g. make one column a "key" to the rest of the "row" as Galaxiom showed in the example)

3) Recodsets has native methods for transforming the contents into array, stream, strings or XML (note: ADO is a bit more flexible than DAO in that case). So we can quickly build an array:

Code:

MyArray = MyRecordset.GetRows()

Arrays are not as easy to transform into other forms and depends on whether the other forms has an native method to accept an array and transform it into the object.

4) Recordset can do a For Each with a specific object:

Code:

Dim fld As DAO.Field

For Each fld In rs.Fields
   ...
Next

This is not legal:

Code:

Dim s(100) As String
Dim t As String

For Each t in s
   ...
Next

The compiler will complain that the variable for For Each...Next loop must be a variant. Why variant? This means more implicit conversion and ergo unnecessary slowdown. For that reason, For...Next loops may be preferable over For Each...Next loops when working with arrays.

Now, at this point, I want to note that when I work with a good number of same things, and when I want some functionality (e.g. point 1 and 2) but don't want/need or can't use a recordset, I turn to VBA.Collection. Like recordsets, they do not require either pre-allocating or re-dimensioning and thus can hold any number of items I want and it comes with keys so it's possible to do a simple "Find" to get to the item. The only thing about Collection is that it always stores an object - there is no checking that a new item being added is actually of same type as other items in the collection. That could be a good thing or bad thing, but in my book it's more bad than good. In .NET, there is a concept of "generics" which allows a data type to be specified for a collection at design-time and thus have the strong-typing checks occur at compile time. VBA doesn't have that but even so, I sometime find collections to serve my needs just in the middle.

Last point. One thing I find troublesome about DAO recordsets is that there is no guarantee that they won't require a disk I/O (usually a bottleneck in grand scheme of things). While they are meant to be mainly in memory, we have no way to prevent it from being written to the disk. With ADO, it is possible to create a recordset entirely in memory and thus do everything without worrying about disk I/O (though that is not still guaranteed because we don't know when OS will swap the content into a swap file as part of virtual memory management but generally that's OK). Thus in some case, I turn to creating an ADO recordset where I need a cheap way to filter, sort or whatnots.

In end, everything has its use, and my inclination is to use the cheapest object possible that will get the job done. If arrays can do it, then it's arrays. If not, look at collections, then recordsets in that order.

HTH.

Galaxiom · May 7, 2010

Insightful as always Banana. It seems I wandered onto the right track using the arrays for this particular task.

MyArray = MyRecordset.GetRows()

That one line was worth asking the question on its own.

I had done it with a loop.

With ADO, it is possible to create a recordset entirely in memory and thus do everything without worrying about disk I/O

Is this a behaviour that needs to be controlled or just the nature of ADO?

Banana · May 7, 2010

Glad to help. I found that little tidbit long ago when I was looking at ways to export the recordsets.

I would say controlled only because ADO by design leaves much of implementation to the providers and it may be the providers themselves that changes the behavior (e.g. one could force a disk I/O to write temp files for instance.)

However, in case of memory recordset, it's ADO that carry the whole weight and I believe it's in-memory (as far as it can - it has no control upon OS's virtual memory management which may write it to the swap files as needed).

To create a in-memory recordset is simple:

Code:

Dim a AS ADODB.Recordset
Dim i As Long

Set a = New ADODB.Recordset

With a
  .Fields.Add ...
  .Fields.Add ...
  .Fields.Add ...
  .Open
  For i = 0 To 2
     .AddNew
     .Fields("...") = ...
     .Fields("...") = ...
     .Fields("...") = ...
     .Update
  Next

  .MoveFirst
  Do Until .EOF
    Debug.Print _
       .Fields(0) & " | " & _
       .Fields(1) & " | " & _
       .Fields(2)
    .MoveNext
  Loop
End With

a.Close

Set a = Nothing

(aircode)

HTH.

gemma-the-husky · May 7, 2010

I just wondered whether a problem might occur with very large arrays.

Because a recordset is effectively disk based (ie pointing at a disk-based object) there are no size issues. I think that an array has to be resident in memory, and therefore is subject to a maximum size, isn't it?

LPurvis · May 7, 2010

As Banana mentions, the OS will write out to swap files if memory usage became a problem.
But the amount of memory in use depends on several things.
Generally speaking, recordsets are memory resident objects too. That's the whole idea. ;-)
Due to the amount of object wrapping (and API interation) that goes on with them, they're heavier objects! (i.e. more memory for the same amount of data - less bang for your buck).
(FWIW I still prefer them by a long way - despite that).

The extent of this being true depends on the type of recordset of course.
In the case of the example we're talking about here - created recordsets - then there's no two ways about it, they're heavier. All the data is loaded in memory - everything you shove in there is resident in memory.
fetching data from disk as you mention, well if we're talking Dynaset / Keyset then there is the scope for a leaner model than a particularly heavy array (i.e. wide dimensions all filled). As, by definition, the array must be holding all that data in memory, whereas then the recordset will be reading the data pages from disk as required - holding only the PK index by default.
However with that created recordset, or any Static equivalent, then the recordset has it loading into memory too - and that's extra weight exists.

As I say though - the advantages of recordsets are worth every iota of IO and every KB of memory IMO.

Cheers.

Banana · May 7, 2010

FWIW, I don't know this for fact but I would be very surprised if recordsets (whether DAO or ADO) weren't actually a particular implementation of Linked List. For anyone unfamiliar with Linked List, it's typically either two or three column array containing the following element:

[Optional] previousNode
pointerToRow
nextNode

All elements are pointer to other space in the memory. So we don't know actually know where a certain row resides in memory unless we read the pointer from pointerToRow part unlike array where typically store the data contiguously and thus is a matter of AbsolutePosition * SizeOfRow.

You may have heard that ForwardOnly Recordsets are faster than other types - well, I'd bet that because they don't contain the previousNode which means one less pointer to maintain per rows (I'm sure there's other optimization going on but that's one reason).

Obviously, it takes more time to build a linked list than an array because we're also writing more stuff (e.g. the pointers to the rows, pointers to actually link the node) and thus navigating a Linked List is always slower than an array. BUT! Guess which will be faster when the operation is to resort the bunch of data in different order? Linked List wins out because we only update the pointers as we navigate the nodes. Array would have to write to a new block of contiguous memory.

Because arrays cannot guarantee that the region of memory just beyond the upper bound is free for allocating, one cannot add more rows without redimensioning which basically meaning copying the whole content into a new but bigger block. That would be very expensive operation. With the linked list, you only need to navigate to the node that has a null pointer for the nextNode element and allocate memory for a new row somewhere (whether it's contiguous or not is of no consequence), update the pointers for both "last" node and "new" node and you're done.

Deleting rows presents similar problem. One can't just delete a row in the an array by de-allocating the memory for that region - that would make the array unstable (you can't be sure that the region won't get allocated to somewhere and if there's an iteration that touches on the "unallocated" region, we may have a problem. A safer way is to zero out the data but not de-allocate it but that presents problems - an array with holes like swiss cheese can be painful to work with. With linked list, you go to that node, delete the elements, and update the nextNode for the previous node, and PreviousNode for the node. That row then can be safely deallocated, and done so with much less effort.

Obviously, a recordset object is much more than just a linked list (assuming my conjecture is correct that it's in fact a linked list as I don't know this for a fact) because we have several more methods and properties to manipulate the rows as whole. There's always going to be some kind of trade-off and looking only at the footprint tells only half of the story - one has to consider what functionality is being required to perform the task.

In VBA, we only have arrays and collection, which is kind of unforunate because when one look over to .NET, we can choose from a greater range of collections such as Array, List, ListArray, SortedList, Dictionary, Stack, Queue and several more which also provide native methods for various operations. They can be either non-generic or generic. However, it doesn't mean they can't be replicated in VBA so I suppose if there were a case where we needed something with smaller footprint and only certain subset of operations, we can build a class or reference a proper library to make use of different implementation of the collection.

HTH.

PS: Regarding the recordset being memory resident objects - I agree with Leigh. I should clarify that DAO recordsets will be in memory just as well as ADO recordsets, the difference being that I find DAO being a bit more eager to flush it to the disk (even if not saved to an actual table). Long time ago, I did some experimentations on how file would grow with new DAO recordsets being allocated and _think_ I recall the file would grow when DAO recordset get quite large enough.

gemma-the-husky · May 7, 2010

can you clarify then

if a recordset is stored in memory, then can a recordset bigger than the available memory still be handled, ie - does it swap out/page automatically - or do you get a "out of memory error".

And why would there be an option setting to limit the number of records shown in a recordset, if it was not reading data into buffers, rather than loading the entire recordset

And why do you have to do a .movelast to count the items.

------------
By the same token - I dont think an array does behave in the same way - because every element of an array is vectored, and can be directly asccessed - so does an array get swapped out in the same way - or is there an absolute maximum to the potential size of an array

--------------
finally Banana's conjecture that a recordset is based on linked lists (surely correct - in fact the whole of access, including [especially] indexes is surely based on pointer chains) serves to illustrate what a shame it is that VBA doesnt expose a pointer type - which would be far more useful than a mere collection, to my way of thinking.

Banana · May 7, 2010

gemma-the-husky said:
if a recordset is stored in memory, then can a recordset bigger than the available memory still be handled, ie - does it swap out/page automatically - or do you get a "out of memory error".

I don't think it's easy to answer because it depends on several factors. I do think it's reasonable to assert that when you request a block of memory too large than what is currently available, the OS will respond by swapping out other parts of memory that has been left unused to satisfy the allocation request. OTOH, I already pointed out that the implementation itself may be eager to flush the content independently of OS. We need to test again, but I believe that if we created a DAO recordset large enough, parts of it will get flushed to .mdb file even though it may not be the "final answer".

And why would there be an option setting to limit the number of records shown in a recordset, if it was not reading data into buffers, rather than loading the entire recordset

I'm not which one you are referring to. Are we talking about Number of maximum locks allowed? I imagine that has nothing to do with how much records we can load in memory but rather how much locks we can acquire, presumably as a measure to protect from massive slowdown due to large numbers of locks to be managed.

And why do you have to do a .movelast to count the items.

Well, if we only had a simple linked list, we similarly wouldn't know how many nodes there actually were until we have added all nodes. Similarly, the chicken farmer would similarly not know how many eggs his chickens hatched this fine morning until he went out to the barn and counted them. But if he just counted those, he has no guarantee that some eggs didn't get broken or more eggs were hatched after he counted. The only way to be sure is to actually collect the eggs bring it over to bench, and count it and consider any new eggs hatched after collecting to be the next day's batch.

Recall that it has been said that some recordsets implement a lazy fetching we may only have a partial list of pointers. Only until we actually navigate to the last record, will we have forced the recordset to go out and collect all pointers that match the criteria. OTOH, the snapshot recordset are easy to count because they don't do any lazy fetching - they just go and download everything in one go and thus we have accurate count.

By the same token - I dont think an array does behave in the same way - because every element of an array is vectored, and can be directly asccessed - so does an array get swapped out in the same way - or is there an absolute maximum to the potential size of an array

I'm sure there is an absolute maximum. I imagine that would be the size of addressable memory. In case of 32-bit machines, that is 4 GB. Obviously we can't build an array exactly 4 GB big and put it in the memory because there are always a portion of memory that is dedicated of OS and thus such array would always end up on the swap file and everything is written to the hard drive as we process it. It'd be definitely very slow. But I don't think we will run into 'Out of Memory' as was the case in days prior to the virtual memory.

But only way to know for sure is to test this theory out as this is just conjecture on my part and it need to be tested. I also imagine it should be documented somewhere.

finally Banana's conjecture that a recordset is based on linked lists (surely correct - in fact the whole of access, including [especially] indexes is surely based on pointer chains) serves to illustrate what a shame it is that VBA doesn't expose a pointer type - which would be far more useful than a mere collection, to my way of thinking.

Well, few points.

In VBA 7 (Office 2010), we now have new data type - LongPtr, which is a true pointer type - it's 32-bit on 32-bit machine, 64-bit on 64-bit machine, and we will be using a bit for API declaration where scalable type is needed to scale with the bitness required.

Furthermore, we do have a mean of obtaining an pointer - StrPtr, ObjPtr, VarPtr all will return pointers to the variables. We also have AddressOf operator to return a function pointer. The only thing we can't do natively is set a new pointer to something, though that could be done via APIs.

Lastly, this is a deliberate design decision. The whole point of Visual Basic was to free the programmers from the burden of manual memory management. Thus, it's easier to write a stable program that won't crash every two minutes or every time you click a button because a pointer was not handled correctly in VB than it is in C++, though it is undeniable that C++ allows for much more power by allowing us to work a bit closer to the metal. I think Microsoft has acknowledged this by developing the concept of unsafe blocks. C# has similar design goals - we don't work with pointers but rather the objects (which under the hood are actually pointers), but if we need pointers, we can create unsafe block and thus get access to direct memory management natively. I don't think that is possible in VB.NET but I don't really work on VB.NET, though.

HTH.

dfenton · May 7, 2010

I think Banana has confused the discussion by failing to use the correct terminology. What he's talking about is an ADO disconnected recordset, and the value of it is that it is not connected back to a table or some other data source (a text file, and XML file, a spreadsheet, etc.) -- it can be synthetic, i.e., assembled on the fly from various pieces of information not available in a SQL resultset.

That's the key to the disconnected recordset: it behaves like a recordset and can be used like any recordset but does not have to be created from a SQL statement.

By defining the disconnected recordset in terms of disk storage, the discussion has become confused. The real point of the disconnected recordset is that it is disconnected, and that it doesn't have to be created from a SQL statement, so can thus have any arbitrary data in it that you like.

Banana · May 7, 2010

Hmm. That's not how I've understood what disconnected recordsets are and I didn't even talk about it.

Looking at this KB article discussing disconnected recordset, they have this snippet:

Code:

' Disconnect the recordset.
Set Rs.ActiveConnection = Nothing

The article describes disconnecting the recordset as a means to be able to update recordset without having to commit the edits back to the source.

The above example I provided did not even discuss this - I simply created a recordset out of nothing. This could be later connected to a source but that would be unusual and would require that the content of recordset match with the source for the update to make sense, if that is even possible. Furthermore, I mentioned earlier that I believe that behavior may change once it's connected to a provider - we don't know for sure whether a given provider will force a flush or not. Hence, my reference to it as "in-memory recordset" rather than "disconnected recordset". Maybe there's a better term for it but I tend to think that "disconnected recordset" is perhaps too specific for the discussion of how recordset are managed in terms of memory allocation.

As for the other statement regarding that it does not have to come from SQL statement, I should point out that's a feature of OLEDB - OLEDB was designed to work with any kind of data sources, whereas ODBC assumes SQL RDBMS.

LPurvis · May 7, 2010

FWIW I've always used the distinction that a disconnected recordset is one which has no active connection to a datasource.
A subset of this is the created (or generated) recordset - which never had such a connection.

AFAIK there's nothing official about that second designation, but there does need to be some kind of distinction, and so that's what I use.

Generally speaking, for connected recordsets, the data will always be fully loaded into the recordset as soon as we're talking about client side recordset cursors (no lazy fetching, no keyset behaviour of data page fetching possible).
The disconnected recordset being a subset of those. (All these subsets... I'm getting Pure Maths flashbacks :-s)

So, while we're talking about "weight", the comparison probably needs to be against clientside. (Of which DAO lacks a direct analogy, so we are indeed pretty much comparing ADO recordsets against arrays here).

Must dash. (Apparently it's Friday night).

Galaxiom · May 8, 2010

I love asking questions about database philosophy here. One always gets more than bargained for. I take my hat off to Banana and LeighPurvis whose contibutions and interactions invariably provide invaluable insights.

gemma-the-husky said:
And why do you have to do a .movelast to count the items.

Perhaps I am misguided but my understanding is that a DAO recordset needs the MoveLast because it is only loaded in sections as required while an ADO recordset is always fully loaded hence does not require MoveLast to count the records.

However I note that the forward only ADO cannot provide a RecordCount which kind of suggests that an ADO recordcount is intrinsically moving first and last behind the scenes anyway.

Banana · May 8, 2010

Well, it's a bit more complicated than that.

(But they always are!

)

With DAO, we can choose between Dynaset and Snapshot (there's Dynaset (Inconsistent Snapshot) but that's rarely used). With Dynaset, we basically eagerly download the keys for the navigation. The actual rows are downloaded separately. With snapshot, the rows are eagerly downloaded.

In both cases, though, it's back to the problem of counting - we don't know how many until we actually fetch them all, so we have to wait until { Dynaset: all keys are downloaded | Snapshot: all rows are downloaded }.

With ADO, if my memory serves, client-side static recordsets are similar to DAO's snapshot recordset (though unlike DAO's snapshot, ADO's static can be updatable). ADO's server-side keyset would be similar to DAO's dynaset. ADO also support dynamic, which I _think_ is similar to keyset but has additional checking to maintain the versioning of the data.

Now, with DAO, if you open a recordset in a certain way, you can obtain an accurate recordcount without doing any kind of navigation. This was originally a fun puzzle question for which the answer has been already revealed.

There are probably more to it, but I think that should be the gist of it.

Galaxiom · May 8, 2010

Rushmore Optimisation? I still find that understanding more about Access just equips me to realise how little I really know. Does this feeling ever go away?

I remember feeling like that many years ago when I was learning about electrical and electronic stuff but it seems rudimentary by comparison.

gemma-the-husky · May 8, 2010

my conjecture about memory was this

if you have a PC with only 256MB memory (say) - then I expect that you could still handle a very large recordset with more than 256Mb data, because the recordset would load rows into memory as it needed them.

However, I expect that you could not handle an array of this size - because it wouldn't fit in memory.

Or is the compiler smart enough to page/swap out bits of program variables (ie the array), and enable you to manipulate arrays bigger than the available memory.?

Galaxiom · May 8, 2010

Just for the tiniest reference back to my original question for the sake of overall continuity.

I was talking about using an array to hold a handful to a couple of dozen records in an array rather than a recordset.

Clearly the recordset has indesputable advantages in large sets especially dynamic sets. But my instinct is the (did someone say) "vector based addressing" (I do like that term) and "lightweightness" of the array might be better on a small fixed recordsset that is being repeatedly applied.

This is the description of a set of string matches to contol the parsing of lines in a large text file. I'm working with tens of thousands of lines at a time with a handful of strings tested on each line.

I will certainly clock the recordset versus the array when I finish the project and post the results.

Banana · May 8, 2010

GalaxiomAtHome said:
Rushmore Optimisation? I still find that understanding more about Access just equips me to realise how little I really know. Does this feeling ever go away?

Now, to be fair, if you mention the phrase "Rushmore optimization" to any Access developers, you may either get a blank stare or quizzical look, "Yeah, I remembering hearing this somewhere." It's just not widely discussed, probably because it does not always apply - it primarily works for dealing with multiple indexes - in Access world, most of queries that get bound to forms almost always is one table. In Reports, we might have several tables joined but in somewhat hierarchical fashion, typically with just one column participating in the join. So I suspect Rushmore was just something that didn't always avail itself as necessary solutions to common problems developers face. This is not to say that Rushmore is useless or worthless- it's very invaluable and I've been under the impression it was basically what made Jet so much better when they incorporated it. It originally was for FoxPro but Microsoft then bought it and gutted its technology.

The fact is, if you want to learn more about Rushmore, you are more likely to find more literature on this subject if you look within articles targeted toward FoxPro developers than articles targeted towards Access.

I remember feeling like that many years ago when I was learning about electrical and electronic stuff but it seems rudimentary by comparison.

Believe me, there's always more to learn. I feel quite the same way when the subject turns to Sharepoint.

gemma-the-husky said:
my conjecture about memory was this

if you have a PC with only 256MB memory (say) - then I expect that you could still handle a very large recordset with more than 256Mb data, because the recordset would load rows into memory as it needed them.

However, I expect that you could not handle an array of this size - because it wouldn't fit in memory.

Or is the compiler smart enough to page/swap out bits of program variables (ie the array), and enable you to manipulate arrays bigger than the available memory.?

Well, the way I understood it is that arrays, by definition must be contiguous. But I don't know for a fact that there won't be magic going on. Too bad my XP VM got toasted - that would have been a good way to test it out. Maybe when I have free time to create a new XP VM I can try and allocate it too little memory then see what happens when I make an array too big. I earlier said that I've imagined that OS would just swap it out to the VM file so it'd be essentially as if you were writing to a file that's 512 MB file big rather than storing it in the memory. But I don't know that for a fact so will have to test that.

GalaxiomAtHome said:
Clearly the recordset has indesputable advantages in large sets especially dynamic sets. But my instinct is the (did someone say) "vector based addressing" (I do like that term) and "lightweightness" of the array might be better on a small fixed recordsset that is being repeatedly applied.

Just to be sure, I'd wager that recordset would even beat out arrays even for a fixed sets in certain operations. I don't think that can be done with DAO recordsets, but with ADO recordsets, it's very easy to re-sort the list. Sorting the array is not a simple task and thus could end up quite slow. I certainly would agree that lot of recordsets' advantage become apparent when we have a set that could be dynamic and not so much when it's fixed, but we have to consider what exactly we are going to do with to decide what should work the best.ousands of lines at a time with a handful of strings tested on each line.

I will certainly clock the recordset versus the array when I finish the project and post the results.

That'd be awesome. Give us some concrete numbers and a better idea of how things actually work out (if only for a particular implementation which isn't always easily generalized but specific example is better than a bushels of conjectures and brainfarts, I would think.

)

MarkK · May 8, 2010

Banana, OMG, this is a beautiful thing!

' Disconnect the recordset.
Set Rs.ActiveConnection = Nothing

This one line of code has pretty significant implications to me, like overnight it'll change how I do data access for class modules.
Thanks for the insight,
Mark

LPurvis · May 8, 2010

>> This is the description of a set of string matches to contol the parsing of lines in a large text file. I'm working with tens of thousands of lines at a time with a handful of strings tested on each line.

Have you considered using recordsets to do everything, including the loading of text file data?

It won't necessarily be fastest - but ADO (using the Jet provider ;-) loads text file delimited data quite nicely.

Cheers.

Arrays vs Recordsets

Super Moderator

split with a cherry atop.

Super Moderator

split with a cherry atop.

Super Moderator

AWF VIP

split with a cherry atop.

Super Moderator

split with a cherry atop.

AWF VIP

split with a cherry atop.

AWF VIP

Super Moderator

split with a cherry atop.

Super Moderator

Super Moderator

Super Moderator

split with a cherry atop.

bit cruncher

AWF VIP

Similar threads

Users who are viewing this thread