Multiple Linear Regression on a table's data

KKilfoil

Registered User.
Local time
Today, 04:28
Joined
Jul 19, 2001
Messages
336
I need to be able to perform a regression analysis on data in an access 2002 db for three independant and one dependent variables (i.e, the value X depends on three variables A, B, and C, and possibly the inter-relationships between A, B and C.

Access seems to have no good native tools to do this.

Can anyone recommend a good way to do this, hopefully that does not require the purchase of an add-in. This will be for commercial use.
 
I would have thought that Excel would be the tool to use not Access.
 
The data is hourly readings over 10 years, so I've got 87000+ records at present, and the number will grow over time as more readings are captured.

Excel has issues when you get past 64K rows/records.

In addition, the rest of the application I am developing exists as an Access db.
 
Do you know the formulas for the kind of regression you want? Because Access has everything you need, if you would only look at it the right way. A simple Summation query with a couple of formula summations will give you EVERY SUM YOU NEED to do a linear regression.

For linear regressions, you need a count of observations (COUNT) plus sums of your X & Y (independent and dependent) variables as follows: Sum(X), Sum(Y), Sum(X*Y), Sum(X^2). To generate the standard error of the estimate (a measure of data quality) or the Pierson Product-Moment Correlation Coefficient (another popular statistical measurement), you need to take only one more sum: Sum(Y^2). That's it. That's all you need.

EXCEPTION: If you want quadratic regression or something that involves a change of coordinate system, you either need a couple of extra sums or you need to do a transformation. For instance, if one of the things you are measuring is logarithmic rather than linear, you need to take the sum of the logarithm, not the direct sum. Then, once you have your numbers computed, you can perform a reversed transformation when appropriate. Same concept for trigonometry or other types of totally bizarre systems, but I doubt you are going there so I won't belabor the point.

The formulas for regressions are published and easily available in textbooks or references all over the place. Once you have the summation query, you are so close you can taste it. Because AT MOST ONE MORE QUERY (based on the single-record summation query) is required to compute the values you need. And OF COURSE you can nest queries, so the final query can be based on the summation query, which is based on your raw observation table (OR a query that offers some type of selectivity from your massive table.)
 
The_Doc_Man:

I should provide more detail:

I do know how to do a one variable regression in access on a set of x,y data to get the best fit for c1 and b in the following (using a bit of code):

y = c1 * x + b

However, my data is actually of the form x1, x2, x3, y
(i.e. three independant and one dependant variables).

What I am trying to find is the best fit for c1, c2, c3, c12, c13, c23 and b for the following:

y = c1*x1 + c2*x2 + c3*x3 + c12*x1*x2 + c13*x1*x3 + c23*x2*x3 + b

where I am pretty sure c1, c2, and c3 and b have significance, and I am not yet sure about c12, c13, and c23.

I am looking for either a bit of vba code that will do this, or a source algorithm that I can turn into vba.
 
I don't have any references handy, but there is an old algorithm called SIMPLEX OPTIMIZATION that might do the trick for you, if you don't mind waiting a while for it to run. Look it up in textbooks no more recent than fifteen years ago. Maybe older. Or try the internet. (Be aware that any internet search on SIMPLEX will find SIMPLEX Corp. as a maker of labor punch-clocks. So specify & filter your search carefully.)

The key to an exact solution is to partially differentiate the equation with respect to each independent variable, generate several differential equations, re-integrate them (writing the integrals as sums) and then solve for each of your C-sub-nm values as a problem of multiple equations in multiple unknowns.

To be thorough, you need one more variable in your mix...

c123*x1*x2*x3 - the three-way interaction case. You only had listed the uni-factor and bi-factor cases. You should also, for mathematical completeness, test for the tri-factor case. Otherwise you'll never know if it contributes.

Now, I'm going to bet you don't want to go that far, so I'll make a quick-and-dirty suggestion for you as a simpler pre-screening process that might give you a usable, if not perfect, answer. Or will at least let you know how far you have to go.

You have alluded to having a lot of data. This helps a LOT. Organize your data in such a way that you can isolate various factors. Like, find all cases where x2 and x3 are the same across multiple samples, then run the linear regression for the set of samples having variation of x1 only. Then do the same for cases where x1 and x3 are the same, then ... and so on. Maybe you can break it down into several sub-isolations and compute the coefficients for each.

Now, go back using the isolated coefficients. Temporarily ASSUME that the 2-way and 3-way interaction coefficients are zero. Compute the residuals (variances) between your observed results and the results you would compute using only the uni-factor coefficients.

If your residual/variance is small enough, your assumption was valid and your problem is solved. If not, then you know you have to do it the precise but terribly ugly way.
 
The_doc_man:

Sorry for the belated reply, but I was crunching on my problem....

Your pointer to the simplex optimization was very useful. As it turned out, my data didn't follow linear behavior, and was only accurate when the actual data selected to 'regress' was 'near' the predicted values for x1, x2 and x3.

In the end, I wrote a code module in access (DAO) to get the parameters using the Improved Simplex Optimization algorithm, with data chosen to be 'near' the predicted values for x1,x2,x3 (as per your suggestion)

The best curvefit model turned out to be:
y = a1 + b1*x1 + b2*x1^2 + c1*x2 + c2*x2^2 + d1*x3 + e11*x1*x2 + e13*x1*x3
(or eight parameters)
 
Glad I could help. Egad, 8 parameters! But at least you found out what you needed to know. And Simplex really doesn't "care" HOW many parameters you have as long as you have enough data to support the operation.

Simplex is such a pain in the toches to set up, but when it works, it is a real hum-dinger.
 

Users who are viewing this thread

Back
Top Bottom