Need advice on DB design. (1 Viewer)

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
Hello again. Looking for great advice that I always get on this forum. I'm working on a new application that tracks sales.


Let's say the application has these characteristics:

  • There are 800,000 salespersons each with unique ID (I).
  • The number of product groups being sold could be 6 to 12 (P).
  • Each product group could have up to 20 subgroups (S).
  • Each subgroup will have 12 attributes which is where sales quantities are given (A).
  • This sales data will be imported once a month.
  • A date for the monthly sales data will either be included or it can be generated.
Typical queries might be something like:

  • Show me most recent sales for Salesperson (I1) for Attribute (A1) for Subgroup (S1) for Product (P1).
  • Show me total sales for Attribute (A1) for Subgroup (S1) for Product (P1) for last 12 months.
  • Show me best selling Attribute (A1) for Subgroup (S1) for Product (P1) for last 12 months.
  • Show me most recent sales of all Attributes for all Subgroups for all Products.
  • Show me total sales of all Attributes for all Subgroups for all Products for last 12 months.
My questions are:

  • Since the number of potential fields for just this data could be 12x20x12 or 1,368, I can't put all products in one table (255 field Access limit). I suspect that Access' 2GB limit will also be a problem. Therefore, linking the tables from different databases (D1 thru D12) is an option. What would queries generally look like if I'm summing multiple product groups?
  • Is this too much to ask of Access?
  • If not, besides the obvious indexing that would be required, is there any other performance consideration?
Thanks as always for any comments or suggestions.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 09:13
Joined
Feb 28, 2001
Messages
27,159
Given the number of sales persons, you probably are looking at SQL Server as a back end. Access could certainly still handle a front end.

However,

Since the number of potential fields for just this data could be 12x20x12 or 1,368, I can't put all products in one table (255 field Access limit).

No. Absolutely not. That is flat-file (EXCEL) thinking. You have a natural hierarchy of groups and subgroups. Each level of the hierarchy would be a table for that level only, with a child table for the next level down. Keep the levels separate. Then the number of fields does not enter into play.

Your typical queries are all easy. Look up two topics: TOP predicate and SQL Aggregate. These topics lead you to either the "most recent sales" (TOP when ordered by descending date), total sales for a given A, S, P = (SUM of a field within a query that has GROUP BY statements for A, S, P)

Your issue is going to be a design problem first. You have a lot of data. But if you store too much with every individual sales record, you will blow out darned near anything you try to use for the data storage. Think "NORMALIZATION" right now while you are in the design phase or you will put a .44 magnum slug in your foot.

I'm not quite sure how the word "attribute" fits in. What is the product or service being sold? You show "product group" and mention a subgroup and an attribute - but what is actually being sold and how do these things fit in?

Regarding your question on indexes and performance... with this many sales persons and the detailed volume implied by 800,000 sales reps, if you DON'T index stuff, you won't HAVE any performance whether we are talking native Access, SQL Server, or ORACLE ENTERPRISE SERVER on a really high-performance back-end machine.
 

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
Thanks loads, Doc.

Think of the app this way.

Product groups are vehicle brand names (e.g., Ford, GM, Kia, etc.), subgroups as type of vehicle (e.g., sedan, SUV, truck, etc.), attributes as colors (up to 12 of them).

Budget plays a big role so we're trying to fit this into Access. A fast query response time will not be as important as getting reports, even if they run a while.
 
Last edited:

plog

Banishment Pending
Local time
Today, 09:13
Joined
May 11, 2011
Messages
11,643
Normally I wouldnn't post just to confirm good advice, but I want underscore how important normalization is here.

With the amount of data you are talking about properly structuring your tables makes or breaks this thing in Access. Set it up like the spreadsheet method you implied in your initial post and Access just won't be able to accomodate your data. However, if you properly normalize your data Access should be able to handle this data.
 

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
I understand normalization to mean setting up tables so that duplication of data is reduced as much as possible. Also, that searched fields are indexed. I know it's hard to describe in words here, but given my examples of Product Groups, Subgroups and Attributes, can someone please explain a little more how this data would be normalized?
 

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 15:13
Joined
Jul 9, 2003
Messages
16,278
I can't put all products in one table (255 field Access limit). I suspect that Access' 2GB limit will also be a problem.

I'm with Richard and Jason on this as well. You see the problem is, from an Excel user's point of view MS Access looks a lot like Excel, but with a stupid limit on the number of columns! So you ask yourself how am I going to get round this column limit problem? And that will lead you to hell! The CORRECT solution is counter intuitive, you don't put your data in columns you put it in Rows. Now, you might think that's a lot of bother, and it might well be, and you might think I'm sure I can add my records to numerous tables and link them all together, and yes you can do that! But I can 100-percent guarantee you will regret it down the line.

I have blogged about it in text, and video on my website here. Have a look and hopefully you will see that you will be much better off kicking off on the right foot as it were. Heed the advice here!

http://www.niftyaccess.com/excel-in-access/
 

plog

Banishment Pending
Local time
Today, 09:13
Joined
May 11, 2011
Messages
11,643
Suppose you have sales data By month for 3 sales people. Using a spreadsheet you use 13 columns (SalesPerson + 12 months) and 3 rows (one per sales person). In Access you use 3 columns (SalesPerson, SalesMonth, SalesAmount) and 36 rows (3 sales people * 12 months).

That's the difference we are talking about. The Access method (aka normalization) is more efficient for a database.
 

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
More info:

Data will arrive as delimited text file, 800,000 records with 450 fields per record. First few fields are sales person identifiers, next fields come in sets of 12 with each field in the set representing a different attribute and containing the quantity sold. The header will identify the group and subgroup. So, the fields may look like this:

sales id, 1A-1, 1A-2, 1A-3, ... ,1A-12, 1B-1, 1B-2, 1B-3, ... , 1B-12,... , 12A-1, 12A-2, 12A-3, ... , 12S-1, 12S-12

where the first digit is the group number (6-12), the second alpha character is the subgroup (A-Z), and the characters after the hyphen represent the attribute number (1-12).

I was thinking that I would first split this .txt file horizontally by group number, with sales id included in first field of each split file. Then, I would import these files into separate databases. The tables would then be linked.


Note that every sales person will have sales. Also, less than 5% of the possible sales buckets, on average, are non-zero. In the most recent monthly data, the group with the most subgroups has 19 of them, with 12 attributes each. With 780,000 sales persons, that's about 180 billion sales buckets. So, with 5% being sold, about 8.9 million buckets will be non-zero.


Comments please?

Thanks.
 
Last edited:

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 15:13
Joined
Jul 9, 2003
Messages
16,278
I think your problem is going to be importing directly from a comma separated value file into MS Access. Due to the number of columns I don't think you're going to find this simple! You might be better off importing them into Excel and then dividing the data up in there. Alternatively I think you could use Python, it's got some excellent facilities for handling data like that. There was a similar thread here a while back, and I seem to recall someone had some software for doing it. I might have a look for it if I've got a minute.

Now why did I say that! Well in 2017 I got it into my head that I wanted to import comma separated values directly into MS Access. I've looked at my notes and copied them here below with some minor edits:-

From my 2017 Notes on importing CSV with VBA...
The problem was that when I tried to update the table to the new requirements if I had a small table less than 50 Fields it worked fine however when I got up to tables of size 255 Fields I got error 3047.

Something to do with having too much data somewhere or other, I assumed in the table.

I experimented with tables with less data in and sure enough I could modify more Fields but I could still not get above 90 Fields.

I realise that the information I have gathered from the data could be used in a different way instead of modifying the fields in the table that was already created I could create a table from scratch and use the information to make the fields the right size for the data.

Once I created this table I wouldn't have to go through it and modify the fields because they would already be the right size so I would avoid the error.

Bit long winded but it seems like a workable solution to my problem.

I did this however and I still got the an error…. so now I have a table with the correct number of fields, the correct type of fields for the data that will be put in them, but I'm still getting an error.

The annoying thing is if I set the data transfer app so that it only transfers 45 columns from the comma separated value to the first 45 columns in the table it works ok. However, if I try and transfer 255 columns really anything above 50 columns, then it fails.

Then I thought --- well why not transfer over blocks of data, if I transfer over the first 45 columns and then adjust the code so that it transfers over the next 45 columns and so on until all of the data is transferred from the comma separated values file into the new table.

The first 45 columns transfer over properly however nothing else will transfer over. I don't get any error message or anything I just get an empty table.

During this process I also created some code that would import the comma separated value file which has more than 255 columns it does this by creating several new tables, as many as required to hold the data from the comma separated value file.

However this process is still a compromise because I cannot copy 255 columns I have to copy over 45 columns so I end up with lots of tables with 45 columns in it.

my next move….

I could rewrite the code and see if I come up with a solution however after two days of head scratching I want a solution that I know is going to work so my next phase is to join the new tables with 45 columns in each together to create a new table I haven't quite worked out how I'm going to automate that yet but I'm going to give it a try.

I haven't got a clue why I started doing this it just seemed like a good idea at the time!!
 

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 15:13
Joined
Jul 9, 2003
Messages
16,278
Please see my updated comment above.

Well I'm not sure we're both on the same page. I saw your comment >>> I would first split this .txt file horizontally by group number<<< and have assumed that you will be importing the 450 odd columns?
 

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
No, I'd split the 450 fields in the .txt file horizontally by group number into separate .txt files. To analyze the data, I split the .txt file vertically into multiple files of 10,000 records each, and then I imported one of those files into Excel. I found that one group has 2 subgroups, and another has 19 which is the most. So, in the case of the group with the 19 subgroups, with each subgroup having 12 attributes, there would be 228 fields (plus a sales person ID field) that I would import into Access.
 

Uncle Gizmo

Nifty Access Guy
Staff member
Local time
Today, 15:13
Joined
Jul 9, 2003
Messages
16,278
I went back in time and found the thread where I was messing around with importing comma separated values with over 255 fields with VBA.

Joining large tables together

I had a lot of problems with it as you can see in the thread. Thought it might be useful to you in your endeavour!
 

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
I tried to import a .txt file with 229 fields and 780,000 records. Access said it ran out of resources, which I did not expect. There may be a problem with that .txt file. I'll take a look at your solution.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 09:13
Joined
Feb 28, 2001
Messages
27,159
OK, "ran out of resources" is a different question. Not the first time we have seen this.

"Resources" is a reference to a Windows data structure that is used to handle stuff of different types including graphic images, icons, programs, and files. You often run into the problem of resource depletion when something you are doing forces you to handle too much at once or too many different things at once. Let me leave this idea, though, because I think normalization comes first.

You have a sales person. I infer that your spreadsheet is rows of monthly sales broken down such that each cell represents a different product group, subgroup, and attribute. That is "flat-file" layout and is what spreadsheets do.

The Access equivalent is that you would have a table of sales reps. Each rep has a RepID.

You would have a table of products where group, subgroup, and attribute are fields. You would have a ProductID that uniquely identifies a single combination of group, subgroup, and attribute. From your discussion, you might have maybe 1200-1500 combinations of these.

The master sales table would have as fields the RepID, ProductID, and whatever you use as a date, plus a quantity of sales - BUT only if not zero. You will STORE NOTHING at all if nobody bought the Baby-puke green Cadillac Escalade.

To import the dataset, you need to learn how to open an Excel Application Object (look up that topic for reading material). You would open your spreadsheet and write a VBA loop to scan that big, ugly puppy one row at a time, traversing each cell to find a non-zero sale. You have the Rep ID from the first couple of cells in the row, and you know which cell you are reading by its position within the row. So you can figure out which Product ID to use when entering a non-zero sale. Inserting a sales record would probably be done via a recordset that stores RepID, ProductID, date, and quantity - four fields per non-zero entry.

OK, 780,000 rows later, you have finished this big ugly puppy. Close the object. You now have a list of ONLY those products that were sold.

To build your reports, you would now write a JOIN query to tie together each sales person ID to the sales person details. AND that same query can JOIN a second time to the Product table to tie together each product with its group, subgroup, and attribute.

This query can then be the basis for all of the things you asked about in your question about queries, with judicious use of the SQL Aggregates and TOP predicates.

BE WARNED: You said you didn't mind if it would be slow. This WILL be slow. But with that mass of data, you don't have a choice. You are looking at storing roughly 225 million fields. 225 million of ANYTHING will take a while to handle.

There is also the issue of "blowing out" resources. Look up in this site the topic "maxlocksperfile" which has some data on resource problems.
 
Last edited:

Mark_

Longboard on the internet
Local time
Today, 07:13
Joined
Sep 12, 2017
Messages
2,111
Let's say the application has these characteristics:

  • There are 800,000 salespersons each with unique ID (I).
  • The number of product groups being sold could be 6 to 12 (P).
  • Each product group could have up to 20 subgroups (S).
  • Each subgroup will have 12 attributes which is where sales quantities are given (A).
  • This sales data will be imported once a month.
  • A date for the monthly sales data will either be included or it can be generated.

This implies the following
Each record in the "Sales" table would have
SalesID - Autonumber
Salesperson - Reference to the Salesperson record, number
ProductCode - Reference to ProductTable record.
DtTransaction - DateTime that holds WHEN the transaction took place.

Your product table would hold your group/subgroup/attribute information. Once more, looks like it will be an autonumber primary, descriptive "Name", then your group/subgroup/colour/what have you information.

Does that help with starting to identify what should be in your table?
 
Last edited:

sumdumgai

Registered User.
Local time
Today, 10:13
Joined
Jul 19, 2007
Messages
453
Thanks for ideas and help with setting up the tables.


I've done some analysis of the incoming source .txt file. The 780,000 records hold about 350 million sales quantity buckets (Excel cells). Of these, about 3% are non-zero. That means, by importing only non-zero sales quantities, I'd create about 10.5 million Access rows, each with at least these four fields:
SalesID - Autonumber
Salesperson - Reference to the Salesperson record, number
ProductCode - Reference to ProductTable record.
DtTransaction


Also, this would happen every month, adding to the Sales table. My queries need to show total sales by sales person (s) of product group (s), for the current month, or months (s).


Would I be better off doing a separate database for each product, separate database for each month, or something else?


Thanks again to all who have helped.
 

Dreamweaver

Well-known member
Local time
Today, 15:13
Joined
Nov 28, 2005
Messages
2,466
Don't think this has been brought up after reading a number of comments I notice the 2gb limit is quoted but there is also the table level lmit of 1Gb I don't think think thats changed with office 365.
I have over the years dealt with importing txt files with up to 5million records but i did in that case add the data to the following tables


tblRecordingArtists
tblArtistsAndRecords this table has 2 long integer fields
tblRecordings


there was a number of additional sub tables but they are not important here
While The recordings artists ended up with 1.7 million entries, recordings 800k and the artistsRecords 5.5 mil which is 650mb


these tables have been spreed across 8 backends with tables in each maintaining the normalization.


I know you sead SQL server couldn't be used but by the sounds of it I don't thnk you will have a chouse as the data is just to masive even importing and running a function as I described above could take hrs I created a vb program to do my importing and exporting to my web Mysql.


Good luck with your project
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 09:13
Joined
Feb 28, 2001
Messages
27,159
Would I be better off doing a separate database for each product, separate database for each month, or something else?

Definitely not a separate database. Having extra files means you make it possible to keep more data in Access back end files but every new BE file adds maintenance headaches. Further, you cannot spread a single table over multiple BE files.

each with at least these four fields:

Does it ever happen that someone sells TWO units in the same month? Or does that actually get reflected in the big ugly spreadsheet? If so, you have a fifth field, Quantity Sold, a number which will never be negative. If you take the stance of not storing the slots with zero quantity, then you simply make the dataset sparse.

The 780,000 records hold about 350 million sales quantity buckets (Excel cells). Of these, about 3% are non-zero.

Pardon the pun, but this is where Access excels. This is a perfect example of what we call "sparse" - i.e. lots of possibilities but few actualities. SQL Aggregates don't care that something is absent because a COUNT won't hiccup and neither will a SUM.

Your proposed 10.5 million records would be relatively short. Including my "count", you have:

SaleID: Autonumber PK (4 bytes)
RepID: Long FK (4 bytes)
ProdID: Long FK (4 bytes)
TheDate: Date (8 bytes)
TheCount: Long (4 bytes)

That's 24 bytes per record, about 250 Mbytes for the table. I am not sure about exact sizes of indexes but since you want to query those things, each one of those except the count field would require an index. The PK needs it for structure, but date, rep, and product each need it. You could hold up to 3 month of data in that table before archiving would be required, and that would let you do something like quarterly summaries.

However, I'm not so sure that SQL Server or My SQL has the same limits on table size if you were to use that as a back-end. And if so, then your sparse data squishes down enough that you might get even more month - eventually enough to do full-year trends. Given that you have 250 Mb per month, that's 3 Mb per year, so an Access BE isn't going to do what you want.

Splitting those yearly quantities across monthly BE files MIGHT be possible but you would have to learn how to open and close BE files selectively, open up the tables with fully-qualified prefixes for each separately opened DB, and do your queries in a way to load your summary data to a staging table before producing a long-term report. And you DO run into a limit on the number of simultaneously open DBs, which would tend to limit how much you could summarize at once. Which is why I mentioned a more dynamic approach of opening and closing stuff only when you need it.

My best advice is that the multiple BE method would be tedious and of very high difficulty to maintain, whereas some sort of active BE like SQL Server, MySQL, or even some other ODBC-capable product with high capacity might be better given your volume of data. Which of those IS better? There, I must defer to my colleagues. The few cases of SQL Server that we had at the office were a different project so what I know about that is what I picked up in comparing notes with the gang at my last job before I retired. But it worked well for what they were doing.

I looked up SQL server sizing specifications. Here is the reference:

https://docs.microsoft.com/en-us/sq...fications-for-sql-server?view=sql-server-2017

I can only tell you from personal experience that ORACLE Enterprise Server might have the capacity you want because ORACLE allows you to split things across multiple files that it transparently manages for you (once you set them up). You can even manage that across multiple disks if you have some sort of storage array that acts as network-attached storage. Due to protocols involved, Access "native" BE files don't work quite as well with network storage arrays because that still depends on a server supporting SMB protocol. Array managers tend to not like SMB from non-server systems reaching into their arrays. They want only their own servers doing that reaching. They get funny like that.
 
Last edited:

Users who are viewing this thread

Top Bottom