Data transformation in or out of Access - best practice questions

dcumming · Jul 17, 2024

Hi All,

I'm thinking of re-designing my database to try and improve performance, and introduce a slew of new functionality that I can't get my head round due to the table structures being a mess, and too many data sets doing things that they weren't originally intended for.

There's a few general practice questions I have, that I'd love to hear people's thoughts on.

Currently, I get a number of different supplier spreadsheet in varying formats with different data fields / formats. I have been doing some of the data transformation in excel, and some in access - is there a general best practice for where and when to perform data transformation?
Currently, I have these supplier spreadsheets as linked tables, I get updates daily so do a little data transformation before appending them all into one huge table. Will the performance be considerably better if i attach these supplier spreadsheets as tables each day? is there anything I need to consider here like if I have a Relationships map, how will it be effected, if at all?
Lastly, I was always taught to separate all 'columns' into separate tables were possible, and pull it together in queries. A friend of mine recently suggested that 'is not what the kids are doing now, they go for large tables' Be interested to hear thoughts on this.

Thanks in advance
David

ebs17 · Jul 17, 2024

Best practice - from my point of view:
1) You create a database with a database schema and table structures and relationships that correspond to the functionality you require and your well-thought-out workflow. External supplier tables do not play a significant role at first.
2) Once this database is in the structure, the second step is to import the existing and later future data and divide it into the tables in your database, including the creation of keys for linking.
3) If there are new supplier tables later, you should use a mechanism to only import new data into your tables.

You can then easily obtain the single large kid's table by linking the tables in a query.

GPGeorge · Jul 17, 2024

dcumming said:
Hi All,

I'm thinking of re-designing my database to try and improve performance, and introduce a slew of new functionality that I can't get my head round due to the table structures being a mess, and too many data sets doing things that they weren't originally intended for.

There's a few general practice questions I have, that I'd love to hear people's thoughts on.

Currently, I get a number of different supplier spreadsheet in varying formats with different data fields / formats. I have been doing some of the data transformation in excel, and some in access - is there a general best practice for where and when to perform data transformation?

Currently, I have these supplier spreadsheets as linked tables, I get updates daily so do a little data transformation before appending them all into one huge table. Will the performance be considerably better if i attach these supplier spreadsheets as tables each day? is there anything I need to consider here like if I have a Relationships map, how will it be effected, if at all?

Lastly, I was always taught to separate all 'columns' into separate tables were possible, and pull it together in queries.

A friend of mine recently suggested that 'is not what the kids are doing now, they go for large tables' Be interested to hear thoughts on this.
Thanks in advance
David

Relational database design is a principled method of "separating columns into separate tables". Phrasing it the way it appears here suggests that perhaps some invest in learning about Normalization would be a good way to prepare for a redesign.

Kids these days certainly do "go for large tables". However, as my mother used to say, "Just because all of your friends are jumping off bridges and drowning doesn't mean you should join them."

In fact, there are some principled reasons for using No SQL database designs, e.g. huge ecommerce sites. But the last reason I'd ever give for going that route is "kids today are doing it". They're kids; they may or may not have any idea what they are doing and why.

The_Doc_Man · Jul 17, 2024

I've got to second the notion of studying Database Normalization as a helpful thing. Do a search of THIS forum for Normalization (because we ARE a database forum) but for the general web, search for Database Normalization. The reason is that the word Normalization has specific and non-trivial meanings in math, chemistry, medicine, diplomatic relations, certain types of engineering, ... it's a long list. If you do a web search, you want to look more at EDU sites first because the COM sites usually have something to sell you. But that is not to say that the COM sites don't have good explanations.

As to those supplier data sheets, you will eventually go crazy UNLESS you somehow unify everything to a single supplier table. The question of "best" method depends on what you intended to do next with the sheets once you have them in-hand. Creating too many linked tables can become a problem down the road, particularly when those tables ALL represent the same thing.

It is always ALWAYS ALWAYS right to take time up-front to lay out a detailed plan of how you intend to approach a problem. A plan that isn't only in your head, but is also on paper or in a computer document, is exactly what you need. You need it so that, one of those bleary nights with too much caffeine and too many stale doughnuts to keep you going, you can go back and LOOK UP what you intended to do instead of wracking your tired brain. (If that sounds like I've been there... yep! I have.)

As to preferred Access table structure... Access works best with tall, narrow tables rather than short, wide tables. By "narrow vs wide" I refer to the number of fields in a table. As to "tall vs. short" - Access works best (and is most consistent with normalization) when you make things to be laid out in lists using a "divide and conquer" approach. When you study normalization you would recognize that parent-child tables are your friends, and they tend to be taller and narrower.

dcumming · Jul 17, 2024

GPGeorge said:
Relational database design is a principled method of "separating columns into separate tables". Phrasing it the way it appears here suggests that perhaps some invest in learning about Normalization would be a good way to prepare for a redesign.

Kids these days certainly do "go for large tables". However, as my mother used to say, "Just because all of your friends are jumping off bridges and drowning doesn't mean you should join them."

In fact, there are some principled reasons for using No SQL database designs, e.g. huge ecommerce sites. But the last reason I'd ever give for going that route is "kids today are doing it". They're kids; they may or may not have any idea what they are doing and why.

Hi @GPGeorge

Thanks for the input.
I was more trying to understand why new designers might go for this approach. I was always taught separate tables and whilst I will likely go that way, I wondered if there was new thinking. After all, sometimes we look back and say - why didn't we always do it like this?

David

dcumming · Jul 17, 2024

The_Doc_Man said:
I've got to second the notion of studying Database Normalization as a helpful thing. Do a search of THIS forum for Normalization (because we ARE a database forum) but for the general web, search for Database Normalization. The reason is that the word Normalization has specific and non-trivial meanings in math, chemistry, medicine, diplomatic relations, certain types of engineering, ... it's a long list. If you do a web search, you want to look more at EDU sites first because the COM sites usually have something to sell you. But that is not to say that the COM sites don't have good explanations.

As to those supplier data sheets, you will eventually go crazy UNLESS you somehow unify everything to a single supplier table. The question of "best" method depends on what you intended to do next with the sheets once you have them in-hand. Creating too many linked tables can become a problem down the road, particularly when those tables ALL represent the same thing.

It is always ALWAYS ALWAYS right to take time up-front to lay out a detailed plan of how you intend to approach a problem. A plan that isn't only in your head, but is also on paper or in a computer document, is exactly what you need. You need it so that, one of those bleary nights with too much caffeine and too many stale doughnuts to keep you going, you can go back and LOOK UP what you intended to do instead of wracking your tired brain. (If that sounds like I've been there... yep! I have.)

As to preferred Access table structure... Access works best with tall, narrow tables rather than short, wide tables. By "narrow vs wide" I refer to the number of fields in a table. As to "tall vs. short" - Access works best (and is most consistent with normalization) when you make things to be laid out in lists using a "divide and conquer" approach. When you study normalization you would recognize that parent-child tables are your friends, and they tend to be taller and narrower.

And thank you to @The_Doc_Man too.

The 'planning what I want to get out' before looking at the data itself is definitely the way to go, isnt it?

Really, I hadn't thought of it like that previously, i was more get the data in and get on with it..... but of course that has brought me problems along the line. I've ended up using one table for something entirely different to it's original design and it's caused me no end of grief!

David

The_Doc_Man · Jul 17, 2024

dcumming said:
get the data in and get on with it

This is the typical mindset of a somewhat higher pressure work environment - and it is almost always wrong for anything new.

You always work backwards... by asking "What do I want to get out of it?" Then work backwards a little more to decide how you gather the desired data (omitting what you DON'T need at the moment) so you CAN get out your end report. But that gathering usually backs up a little more to define how you store the freshly acquired data for intended further use. But another back-step leads to the input question: How does the data get there in the first place? Did you think that was the end of back-stepping? Nope, ...one more part. Will the data arrive in the format I need? And if not, what must I do to get into the desired format? And if you think THAT is all, ... nope. No such luck.

For any complex program, design is iterative. You see a problem in your plans so you fix them... only to realize your fix just broke the design for the production of left-handed veeblefetzers. So you fix THAT problem and suddenly realize that as a result, your eaotin shrdlu just broke, and so on. It is never easy because in a compact business, things are inter-related,, so diddling with one thing usually diddles with another as a side-effect. It is the same in programming.

Which is why you need to either invest in wigs to hide the hair-pulling, or invest in work-approved head coverings to prevent hair-pulling.

Pat Hartman · Jul 18, 2024

GPGeorge said:
They're kids; they may or may not have any idea what they are doing and why.

Children like shiny new things. Professional Developers like things that work even if they may be a little old fashioned. If I have a need to do something new and entirely different, then I would certainly look into options other than Access. Change for the sake of change is for companies like Microsoft where they are always changing things and not necessarily for the better. Different is not better.

Usually you don't have any control over price list formats or whatever it is you are importing. Linking each one separately doesn't sound like the right method. I would think that a consolidated table with all prices in one place would make more sense. Given that, you would need to rebuild the table daily if you get changes daily. Just pick a time and always make the change then. When you are doing bulk table updates, it is best to do them outside of normal business hours.

So for a consolidated table, you need to keep a source field and an "as of" date. Then when you have to replace a list, you would delete the "source" and then run an append query that formats the replacement. This of course will prevent you from linking to these tables but in the case of prices,generally they are copied into an order at the time the order is placed so the fact that the current price is different from that on an order is irrelevant.

And since this price table is in constant flux and not related to any other table, then I would put it in a stand alone BE so the bloat doesn't bother your main BE and make sure you run a compact on this db fairly frequently.

dcumming · Jul 18, 2024

Thanks @Pat Hartman

Some thoughts on splitting data into separate db, thank you - been on my mind a bit too. It goes back to the requirements again - I've always had separate tables for each supplier (due to different format spreadsheets) - but that doesn't mean I have to keep it in that way.

I've been running compression on it fairly regularly, at it's biggest it's up to 2Gb, compressed it can come down to 300Mb - I think this is down to how I have structured the tables in an inefficient way. We'll get round that this time with normalization.

Enjoying all your ideas, thank you!

It's been a while since I did any formal Access training, and switching tack a little..... but is "one database for the data, one for the queries/forms" a good idea?

David

mike60smart · Jul 18, 2024

Yes Db split into FE and BE is always recommended.

DickyP · Jul 18, 2024

If you have a properly modelled data design before you start you shouldn't run into such problems, unless the real world subsequently changes the structure of the data.

You can then do changes that de-normalize the data for specific purpose, if absolutely necessary.

In passing, that is the reason I really hate attachments fields.

The_Doc_Man · Jul 18, 2024

dcumming said:
It's been a while since I did any formal Access training, and switching tack a little..... but is "one database for the data, one for the queries/forms" a good idea?

It is a very important idea. The ONLY time I would NOT split a database is if it is (a) not going to be very big AND (b) not going to be shared among multiple users AND (c) not going to be used for very long AND (d) maybe too much for Excel to do cleanly.

The last time I ran into a situation that met all four criteria was when I was analyzing and organizing numerous repetitive experiment results for my dissertation research in the 1974-1975 time frame - which is to say, before personal databases were available. Oh, MAN, could I have whipped out my computations fast with Access! But it was still about 20 years away. All I had was a notebook full of chemical experiment observations and a hand-held calculator with some advanced math functions beyond the basics of most simple hand-held calculators. And a lot of time running and re-running the numbers to verify that my hand operations had been done properly.

Here's the thought behind splitting a database. If you are going to share the DB app with others, then you need to split the data from the other stuff because you can't share data quite so easily in a monolithic database. BUT due to file locking and the dreadful consequences of unexpected lock "collisions" (two people wanting the same exact thing at the same time) you don't want multiple folks opening the same massive database at the same time. This is how databases get corrupted.

SO.... first you split the data into front end and back end (FE/BE) where the data tables are all in the BE. The queries, forms, reports, macros, and modules (the programming part) are all in the FE. You make a COPY of a "master" front end and give everyone their own copy. That means that what each person opens (the FE) is PRIVATE to them. Which in turn means that there is no one else trying to open that file at the same time - thus lock collisions don't occur for the FE. And since you can give them a copy to put on their machines, the file locking that DOES occur is local to each workstation because the FE is locally hosted.

Then with the BE file to be shared, usually you set up locking style to either NO LOCKS (for SELECT queries) or Optimistic Locking (for action queries) and let Access handle the arbitration. Which is easy to do because I believe Optimistic Locking is the default setting. Doesn't totally prevent lock collisions, but does drastically minimize them.

Then, given that most of your changes will be in the FE file it becomes easy to distribute a new FE to users without bunging up the BE file - because it isn't changing when you make code updates. The code is in the FE. Pat Hartman has published auto-updater batch scripts as short as 4 lines long that would guarantee that everyone gets the latest FE version every time they launch. There are other strategies, but they have in common that your code distribution and management become FAR easier for split DB than for monolithic DB.

AND... there is a database splitter tool in the ribbon among the database tools, so mechanically it isn't that hard to split a DB.

dcumming · Jul 18, 2024

The_Doc_Man said:
It is a very important idea. The ONLY time I would NOT split a database is if it is (a) not going to be very big AND (b) not going to be shared among multiple users AND (c) not going to be used for very long AND (d) maybe too much for Excel to do cleanly.

The last time I ran into a situation that met all four criteria was when I was analyzing and organizing numerous repetitive experiment results for my dissertation research in the 1974-1975 time frame - which is to say, before personal databases were available. Oh, MAN, could I have whipped out my computations fast with Access! But it was still about 20 years away. All I had was a notebook full of chemical experiment observations and a hand-held calculator with some advanced math functions beyond the basics of most simple hand-held calculators. And a lot of time running and re-running the numbers to verify that my hand operations had been done properly.

Here's the thought behind splitting a database. If you are going to share the DB app with others, then you need to split the data from the other stuff because you can't share data quite so easily in a monolithic database. BUT due to file locking and the dreadful consequences of unexpected lock "collisions" (two people wanting the same exact thing at the same time) you don't want multiple folks opening the same massive database at the same time. This is how databases get corrupted.

SO.... first you split the data into front end and back end (FE/BE) where the data tables are all in the BE. The queries, forms, reports, macros, and modules (the programming part) are all in the FE. You make a COPY of a "master" front end and give everyone their own copy. That means that what each person opens (the FE) is PRIVATE to them. Which in turn means that there is no one else trying to open that file at the same time - thus lock collisions don't occur for the FE. And since you can give them a copy to put on their machines, the file locking that DOES occur is local to each workstation because the FE is locally hosted.

Then with the BE file to be shared, usually you set up locking style to either NO LOCKS (for SELECT queries) or Optimistic Locking (for action queries) and let Access handle the arbitration. Which is easy to do because I believe Optimistic Locking is the default setting. Doesn't totally prevent lock collisions, but does drastically minimize them.

Then, given that most of your changes will be in the FE file it becomes easy to distribute a new FE to users without bunging up the BE file - because it isn't changing when you make code updates. The code is in the FE. Pat Hartman has published auto-updater batch scripts as short as 4 lines long that would guarantee that everyone gets the latest FE version every time they launch. There are other strategies, but they have in common that your code distribution and management become FAR easier for split DB than for monolithic DB.

AND... there is a database splitter tool in the ribbon among the database tools, so mechanically it isn't that hard to split a DB.

Thanks Doc - it will likely be just me, however my son is doing some admin work for me in terms of stock, so next step would be to look at using the db

GPGeorge · Jul 18, 2024

dcumming said:
Hi @GPGeorge

Thanks for the input.
I was more trying to understand why new designers might go for this approach. I was always taught separate tables and whilst I will likely go that way, I wondered if there was new thinking. After all, sometimes we look back and say - why didn't we always do it like this?

David

Thank you for clarifying.

As I noted, there are places where the No-SQL approach does make sense, but too many naive newcomers who have never experienced anything else don't realize that they are looking only at a narrow slice of the total picture. In other words, my response to the question of why new application designers might go that approach is a) that's what they're taught, and b) they don't know any better because they lack sufficient experience to understand the complete picture.

I guess I'd say that it's not an either-or choice. Pick the approach that best fits the requirements at hand, but don't be fooled into thinking the newest is going to be the best.

I guess the answer to "why didn't we always do it like this?" comes from finding that the newer alternative actually makes more sense. But that's not always the case. We still wear coats, gloves and boots in the winter after tens of thousands, maybe hundreds of thousands, of years of civilization. Styles change, but the basic functionality remains. Until we figure out a way to stay warm in cold weather without overcoats, they're going to be pretty important to survival.

dcumming · Jul 18, 2024

So,

With my PM head on, I spent some time writing down what i believed would be the benefits of doing this, and whilst I wont go into those, I wanted to think about whether it really made sense or I should just spend time unpicking the current database.

It does stack up in my mind, as much as anything else I really enjoy tinkering with Access databases.... and you can't put a price on that

I then went on to writing down what i needed the database to do. Which is to ultimately 'sell more' but save me time, enable me to list on other websites (not just eBay) and provide me with some intelligent decision making into buying stock.

I've written down all the data sets I need - exports for eBay and other sites I currently use, 'what has sold well with the wholesellers", what's back in that sold well etc

All this really started last week when I attended an eBay sellers event, learned more about how their site really works - learned a lot and instantly knew i needed to update how I export data upload files, how frequently etc.

The reason I mention this is that from this event, I've started to combine the requirement 'sell more' with the solution 'data that's required to list well '.

So the next thing I suppose is to start designing the tables - would you agree?

I've an obvious unique key for all the products - the barcode. I sell music. I may start a separate thread on Unique keys, as i may need to have two or even three (some records / CDs have more than one item with the same barcode - an original / reissue of a record, available in two different colours for each release, in different countries... all with the same barcode)

So - does anyone have any software recommendations for database table mapping? I'm sure i could search Google as well as the next person, but be interesting to hear any thoughts.

Also - as i was typing this, i thought i might keep adding to it as the project goes along - might make for an interesting thread, and will be interesting to see if the delviery meets the benefits I've predicted..... interesting with my PM head on, anyway :0)

David

gemma-the-husky · Jul 18, 2024

If different suppliers send you different layouts, ideally you could get them to standardise, but if not you probably will end up having multiple massage processes. Still you only need that development once per supplier.

I used to do some verification with the spreadsheet to make sure it was properly formed, and reject it if not, but I prefer not to have to worry that users might edit the spreadsheet and maybe change the data, so the sooner you can get the data imported into Access/SQL tables, the better.

Edit. When I say in/with the spreadsheet, what I really do is import the data into a temporary table, then make sure the column headers are all there correctly. I also do some checking in columns to make sure there are no obvious data errors. I sometimes need to add columns, eg to change some data into a better format - eg strangely formatted dates. Sometimes I need to add a lookup value so that access can correctly match spreadsheet rows to particular database rows. I still find there are a couple of repeated issues where the easiest/only solution is to manually fix the issue each time.

dcumming · Jul 18, 2024

They definitely wont change their outputs, but not the end of the world as i've got most of the macros to transform them

gemma-the-husky · Aug 26, 2024

Well. The hard bit is working with other people's data. If you can't get your suppliers to change to a standard system, you will have to convert each suppliers data into a useable format. What happens if the Acme Corporation changes their layout, adds a few more columns for instance. Then you will have to realise the format has changed, and amend your system to take that into account. It's no so bad. You only have to do the transform once until or unless the inbound file changes.

The next step is to bring that data, and the data from the other suppliers spreadsheets into your database. Then you can forget about the spreadsheets. You can then repeat the process with successive supplier files, after verifying that the file structures are still consistent if you will.

This is part of the reason why some market sectors try to produce standardised XML or JSON file structures so that all documents comply with an agreed structure, and you don't have to keep reinventing the wheel.

Mark_ · Oct 22, 2024

For part 1 of your question, "Currently, I get a number of different supplier spreadsheet in varying formats with different data fields / formats. I have been doing some of the data transformation in excel, and some in access - is there a general best practice for where and when to perform data transformation?"

I would suggest creating a small Database just to do this work. The only definitions that reside here are a link to the table(s) you will append to, a standardized append query(ies), and a standardized SOURCE query.

For each spreadsheet you receive, link to it from this utility database. Update your SOURCE query to read from your linked spreadsheet.

This will allow you to modify your SOURCE query based on the actual field names (incredibly useful when people muck up the names or positions of columns in a spread sheet) and see exactly what access will be importing. This will also allow you to skip items you don't want/need.

Had to do this when I was receiving a "Standard template" spreadsheet from multiple sources, but each source decided "Hey, this isn't important" or "This isn't the correct name for things" and changed it up. Made it much easier when they "helpfully" would add notes and descriptive columns for me also.

Once you've imported each spreadsheet, simply delete the link.

Biggest benefit is you don't have to worry about getting your import piece mixed up with your "I really want to use this" piece!

Data transformation in or out of Access - best practice questions

New member

Well-known member

George Hepworth

Immoderate Moderator

New member

New member

Immoderate Moderator

Super Moderator

New member

Registered User.

Active member

Immoderate Moderator

New member

George Hepworth

New member

Super Moderator

New member

Super Moderator

Longboard on the internet

Similar threads

Users who are viewing this thread