Struggling with Data Cleaning in Python—Best Practices for Large Datasets?

Neeljy

New member
Local time
Today, 13:40
Joined
Feb 14, 2025
Messages
1
Hey everyone,

I’ve been working with pandas to clean datasets, and while I can handle small ones easily, I run into trouble when dealing with large datasets (millions of rows). Operations that work fine on smaller sets, like .dropna(), .drop_duplicates(), and .apply(), seem to slow down significantly or even cause memory issues.

Some specific issues I’m facing:

  • Handling missing values efficiently – Should I always use .fillna() or are there better ways to impute values at scale?
  • Removing duplicates in a large dataset – .drop_duplicates() sometimes takes forever, any optimizations?
  • Inconsistent formatting – Text columns have mixed casing, extra spaces, and different date formats. Is there a more scalable way to standardize them without looping?
  • Outlier detection at scale – Using .describe() and IQR works for small data, but what about large datasets?
  • Performance concerns – Are Dask, Modin, or Polars worth using for large-scale data cleaning?
I’d love to hear real-world experiences from others handling big data.

What are your best practices for cleaning and preprocessing large datasets? Any tips on optimizing performance without running out of memory?
 
Welcome to Access World! We're so happy to have you join us as a member of our community. As the most active Microsoft Access discussion forum on the internet, with posts dating back more than 20 years, we have a wealth of knowledge and experience to share with you.

We're a friendly and helpful community, so don't hesitate to ask any questions you have or share your own experiences with Access. We're here to support you and help you get the most out of this powerful database program.

To get started, we recommend reading the post linked below. It contains important information for all new users of the forum:

https://www.access-programmers.co.uk/forums/threads/new-member-read-me-first.223250/

We hope you have a great time participating in the discussion and learning from other Access enthusiasts. We look forward to having you around!
 
Hello and welcome to the forums.

We like to keep actual technical questions separate from the introduction forum. I am going to move your post to one of the technical forums.

Since this is your first post with us, there is no way you could have known our preferences. This is done without penalty or any hint of chastisement.
 
Not to mention vile contumely and inglorious invective.
 
I've worked with python a little bit, my son a lot, and definitely I've worked with SQL. My advice is that Python is going to be a lot slower than many other languages, and you might consider trying to clean the data (or clean it to some degree %) using a SQL database rather than Python (who for all you know may be looping behind the scenes anyway).

I've never heard of Dask, Modin or Polars so can't comment on those, except I do know there is a new scripting language coming out just about every time you turn around and that doesn't mean just because they're new, they're good. New scripting languages are almost like new crypto coins. Everyone wants to make one.

I'd try good old SQL for your data cleansing - you can install sql server express for free if you want and including data tools you can use SSIS packages to transform and channel data into different directions, or use a sql query to cleanse a large dataset at a time with all sorts of conditional statements and replacements.

The newest stuff is not necessarily the best - nor even necessarily likely to be.
 

Users who are viewing this thread

Back
Top Bottom