Importing data from CSV

jaryszek · Jul 15, 2019

Hi Guys,

i have one big 16 fields table which is imported from csv without any primary key.

How to standardize data?

How the ideal process should look like (if i do not have PK in source files).

I should have multiple tables and load them to access ? And check if records are deleted or added or have to update them?

Best Wishes,
Jacek

CJ_London · Jul 15, 2019

How to standardize data?

I presume you mean how to normalise data?

you need to analysis the data to determine what can be grouped together - for example if this is invoice data, possibly customer, invoice header, product, product price history, invoice line.

then create the appropriate tables with autonumber PK and relationships

then in turn, import relevant data to each table in order of 'parentage' - e.g. per above.

to do this the easiest way is either to import the csv to a temporary table or link to in.

1. import customers
2. when importing invoice headers you query should link to the customers table on (say) name to lookup the PK to populated the FK
3. import products
4. when importing product prices your query should link to the products table on (say) name to lookup the PK to populated the FK - and for a date, use the earliest date in the file for that price/product
5. when importing invoice line lookup invoice number PK for FK from invoice header, productFK from product table etc

jaryszek · Jul 15, 2019

thank you CJ_London.

Hmm what do you think by lookup for specific row?

1. import customers

you mean run insert into queries from link table to access temp Table? Where i can check which records where deleted or added?

Jacek

CJ_London · Jul 15, 2019

you mean run insert into queries from link table to access temp Table?

no - to your new tables, not a temp table

You need to be clear what deleted means - my usual preference is to not delete the record but to have a 'deleted on' date field. All depends what your data represents - a complete download from some system or other so if a record exists in your table but not in your file then it is 'deleted'. On the other hand if it is an 'update' file, the fact a record is not in the file just means there is nothing to update.

usual basis for each table is to run a number of queries

records in both tables where data has changed - run an inner join update query
or if you need to maintain history an inner join insert query
record in file, not a table - run a distinct left join insert query
record in table, not in file - run a distinct right join update query (or delete query)

where records in both tables where data is the same - no change required but you may want to update a field with that fact

Always worth having some date fields in the tables so you know when the record was created, changed, deleted etc

Hmm what do you think by lookup for specific row?

not sure what you mean by this - your post implies the csv is already in a table format so don't see the point of reading the file row by row (if this is what you mean). It might be faster overall, it might not. But running a number of 'long' queries over the dataset will almost certainly be faster than running those same queries on a row by row basis. You would need to experiment.

jaryszek · Jul 15, 2019

ok thank you very much for nice explanation.

I will not close the topic because i have to rethink whole idea.

Best,
Jacek

Importing data from CSV

jaryszek

Registered User.

CJ_London

Super Moderator

jaryszek

Registered User.

CJ_London

Super Moderator

jaryszek

Registered User.

Similar threads

Users who are viewing this thread