SQL Hangs

A star schema might look like below, however I do not have the subject matter expertise to be sure about the placement of data in the tables (eg AdmissionID and UNSWStudentID (I took some liberty with some names)).

The dimension tables would generally contain every combination of the attributes logically possible, or at least that occur within your data set. Where an item can be attributed to a student in the 1131 course (a fact) - such as RAW ATAR score - then it should be placed there. I don't think it is applicable to the UAC table where you originally had it (for eg)

Note that the OtherEnrolledCourse table does not follow this pattern as there is an unknown number of other courses the student may have enrolled in. A query would need to run to obtain the course codes for OtherEnrolledCourses for say the first 10 sorted by course code or other filter.

UNSW Maths course would, on the basis you only are interested in Course 1131 holds only one record, which you could migrate into the UNSW_Maths table as four columns. It just gives a structure for analysis focussing on any course where the data has been extracted using the same rules.
Ignore/adjust the data type assignments
I hope this gives you a start to the design you need.

1738054127672.jpeg
 
Step 1: Identify and remove duplicates: in what sense are they duplicates?
See attached example, same line item repeated for the Student in the original Excel file.
 

Attachments

  • Duplication Example .png
    Duplication Example .png
    44.6 KB · Views: 10
A star schema might look like below, however I do not have the subject matter expertise to be sure about the placement of data in the tables (eg AdmissionID and UNSWStudentID (I took some liberty with some names)).

The dimension tables would generally contain every combination of the attributes logically possible, or at least that occur within your data set. Where an item can be attributed to a student in the 1131 course (a fact) - such as RAW ATAR score - then it should be placed there. I don't think it is applicable to the UAC table where you originally had it (for eg)

Note that the OtherEnrolledCourse table does not follow this pattern as there is an unknown number of other courses the student may have enrolled in. A query would need to run to obtain the course codes for OtherEnrolledCourses for say the first 10 sorted by course code or other filter.

UNSW Maths course would, on the basis you only are interested in Course 1131 holds only one record, which you could migrate into the UNSW_Maths table as four columns. It just gives a structure for analysis focussing on any course where the data has been extracted using the same rules.
Ignore/adjust the data type assignments
I hope this gives you a start to the design you need.

View attachment 118255
Thank you! I will use this to create the new database.
 
See attached example, same line item repeated for the Student in the original Excel file.
From the picture there are 6 duplicates of each of the 2 records shown, however the data set is much wider than shown. Are the columns shown the only ones of interest in determining a duplicate?

From within Excel:
To identify and retain a single record from a set of duplicates in a dataset, you can use data analysis tools like Excel's "Remove Duplicates" function, where you first select the columns to check for duplicates, then use the "Remove Duplicates" option to identify and remove all but one record from each duplicate set, effectively keeping only a single record per duplicate group.

Key steps:
  • Select the data range:
    Highlight the columns in your dataset that you want to check for duplicates.
    • In Excel: Go to the "Data" tab, then in Data Tools click "Remove Duplicates" icon.
  • Choose the columns to compare:
    Select the specific columns that should be used to identify duplicates.

  • Confirm the operation:
    Click "OK" to remove all duplicate records except for one from each duplicate group.
Important considerations:
  • Identifying the "correct" duplicate to keep:
    If you need to keep a specific record within a duplicate set based on additional criteria (like the latest date or highest value in another column), you might need to use advanced filtering or sorting options before removing duplicates.

  • Data cleaning before analysis:
    Always ensure your data is clean and consistent before identifying duplicates to avoid unintended results.
You might find your dataset is much smaller and amenable to manipulation in Access.
On import to Access assign an autonumber PK to each record.
 
From your data, it looks like you have 6 columns per record that cover "Semester". It looks like you can keep the enrolled term and delete the rest. Before doing this, make a list of the "Enrolled term" codes and have a separate list that tells you what each means. Cutting down the redundant data will help you focus on what you really need to work with, speed up processing, and reduce resource usage as you'll be dealing with much smaller files.
 

Users who are viewing this thread

Back
Top Bottom