Why I Cleaned A Fake Spreadsheet
Photo by Jukan Tateisi on Unsplash
(And What I Learned About My Process)
Before I launch my DataTabby data cleansing service, I decided it would be a great idea to go in and test a couple of things. One of them being, “How long does it actually take me to clean a messy spreadsheet and document the work properly”?
I had a fake dataset generated using ChatGPT. Random rows, random issues, random industry. What any frazzled business owner would pass along to someone they hired to clean their messy data.
It does beg the question: “Why generate just a simple spreadsheet?” The answer being, the people I plan to help are small and medium businesses.
Most of the time, when I see posts on Medium, Substack and LinkedIn, they all talk about the big sexy stuff. The corporations, the companies with money that have ERP systems, large data sets with databases, cloud-based data solutions already in place. Companies that have multiple data analysts and database administrators.
But so many people forget that small businesses are out there, doing the best they can with very little. They don’t have the money to pay a data analyst or a database administrator full time to set up their data or the knowledge to do it themselves. Many don’t even think of themselves as having “data”. They have names, email addresses, favorite recipes, an inventory. All of it running from Excel or Google Sheets, sometimes hastily written while trying to do 3 other things at the same time. So what I’m going to do, is meet them where they are.
I grabbed my Excel sheet, pulled up the app Clockify, and I timed myself. This is what I learned in that process.
Photo by Mika Baumeister on Unsplash
First, because I had no idea what the data was going to contain, I looked over the .CSV file. A wide sweep allowed me to catch any glaring issues and write them down to come back to review later.
I noticed right away that most of the columns had inconsistent data types, all 4 of the columns with names signifying currency were off, some had a dollar sign and some had “USD” text in front. Some of the Service cells also contained words in different case, some capitalized, some not. While it sounds like it was crazy, it actually didn’t take that much to fix.
After that first sweep, I made a rule for myself before I touched anything:
I’m allowed to standardize how the data is represented. I’m not allowed to “correct” the business facts.
That distinction matters more than people think. Removing “USD ” from a number so Excel can treat it like a number is safe. Changing a tax calculation because it “looks wrong” is not safe unless the client explicitly confirms how their business works. So I kept my hands off anything that looked like it needed a business decision, and instead I flagged it for review.
Then I did my first pass, left-to-right, starting with the easiest parts first.
1) Formatting and data types (the “make it usable” step)
My goal here wasn’t to make anything pretty, it was to make the spreadsheet behave like data.
Currency fields: I removed the dollar signs, the “USD” text, and any commas so values became true numbers (not text pretending to be numbers).
Dates: I standardized the date formats so they were consistent and sortable.
Text fields: I cleaned obvious spacing issues and inconsistent casing (like Service names written three different ways).
This step is boring… but it’s where most spreadsheets fall apart. If Excel can’t reliably sort, filter, or calculate, it isn’t data yet.
Photo by Brett Jordan on Unsplash
Standardization
Next I started tightening the categorical fields — the ones that should only have a small set of valid values.
The big one here was Service. It was clearly the same handful of services written in slightly different ways (case differences, extra spaces, inconsistent wording). This is exactly what happens in real businesses: different people enter data differently, and no one agrees on a standard until it breaks reporting.
So instead of trying to remember every variation, I built a reference sheet.
Data Validation and a Reference sheet
I created a separate tab (I called it something like “DataSheet”) that listed the approved values for key fields:
Service (standard list)
Status (Open, Paid, Past Due, Void)
State (two-letter abbreviations)
Then I applied data validation dropdowns to those columns. That way, if someone edits the file later, they can’t accidentally create a brand-new category just by typing “Paid ” with a space at the end.
This is the part small businesses don’t realize they can do. You don’t need an ERP system to prevent bad data entry — Excel can handle a lot of this with basic controls.
As I made changes, I documented them. Not super detailed, just clear notes on:
what I standardized
what I didn’t touch
and what needed the “client decision” stamp.
Because here’s the thing: data cleaning isn’t just a technical task. It’s trust work. If you can’t explain what changed, you can’t defend it, and the client can’t confidently use it.
Photo by Gabriella Clare Marino on Unsplash
What I Learned
Using Clockify, I timed myself from the time I got the CSV, to finishing my cleaning. It took me 2 hours to do the initial clean. This is a little faster than I expected. I handle data cleansing tasks on the regular, so it’s nice to know that something with a multitude of lines that I’m not used to, didn’t take me that long.
Data validation setup prevents future mess. I’ve taken this into account for other reports I’ve had to create, and it not only makes things easier, it makes things faster when there’s only one correct way to enter things. Everyone can know the standard.
There was a section that did not quite add up, literally. Subtotal, Tax, and Total. If this were a report in which I controlled the data, I would go in and adjust it. However, this is one of those circumstances where you don’t think about it until it happens. Do I fix the calculations or leave it be? I decided that going forward, it is in the best interest of myself and my client for me to not make any calculation changes, even if the data is right there. Instead, I mark the issue alongside any others like this, and send off an email.
Could I make the change? Of course. The data is right there, Unit Price, Subtotal, TaxRate, Tax and Total are all columns with data. However, trust is important when it comes to being a consultant and working on someone’s data, and I’d rather confirm they want this fix than me doing the work only to find out that Chris in accounting has his own crazy way of doing the numbers and now everything is off.
Why This Matters
Why is this important? Doing this time and time again is good practice. It allows me to understand how long it takes me to review someone’s data and factor in the size of the data vs how much work needs to be done to it at a glance. So when clients ask me “how long does it take?” I can tell them accurately.
Running practice data cleaning also allows for understanding what types of data problems companies usually encounter. There are quite possibly common issues that most small businesses share that could be based on the amount of data that is processed. This could be based on not only small businesses but also on the type of business it is, and the geographical location.
Knowing this information allows for development of solutions: whether it’s something I build myself for data collection and organization, such as a Notion template, or finding a tool that can help them get more organized and is specialized, such as ClickUp.
Another reason for doing practice data cleaning runs is for learning how to best document your findings. It’s one thing to just clean a dataset and hand it back, it’s another thing entirely to document what happened, put it in a way that’s understandable, and give it back to the client. This is key, because the client managed to get into the situation due to lack of knowledge. More knowledge about why it happened, how to fix it should something happen, and how to maintain consistency is always a good thing. Cleaning a dataset and handing it back quickly makes you good, setting them up for success afterwards makes you better.
It’s Your Turn
One of the best things about modern AI tools is that you can generate practice datasets for any industry. Want to learn data cleaning for restaurant data? Ask ChatGPT, Claude or Gemini to generate a messy inventory spreadsheet. You can ask for a specific problem to try to clean, or just have your favorite LLM make up problems, and you have to find them.
Here’s the prompt I used:
”Hey ChatGPT make me a spreadsheet of data to practice cleaning. Make up a small company in any industry, and base the data in the spreadsheet around that industry. I want the spreadsheet to have less than 2000 lines, make the errors in it the same types of errors you’d normally see made by just a regular person who has to use the data every day. Also create a second worksheet of all of the problems you made in the spreadsheet, so I have a point of reference to review.”
Within seconds I had 1,000 rows of realistic-looking data with all the problems I’d see in real client work: duplicates, inconsistent city names, missing values, calculation errors. The best part about this is that you can customize the prompt. If you specialize in a specific industry, say that. Add problems you’d like to fix. Control the complexity. It’s up to you. Then time yourself. Make mistakes and learn what works. Now I have a process I can confidently offer to clients, and you can too.
This process goes beyond offering a service to clients. It’s also a way to keep your skills sharp, and to understand the true processes that users may have in place. It’s not just about showing off technical skills, it’s about being trustworthy, communicating, and setting up clients for success.
PS: ☕ If you liked this and want to support the journey, you can buy me a coffee — every little boost helps keep the pawprints moving forward.