How To Intentionally Dirty Up Data | Download Real-World Data

When it comes to data analysis, the availability of data on the web is quite vast, as such there is no lack of it for projects and learning purposes. However, the issue arises when it is too clean, too uniform, and does not reflect the reality of a job for example, where the database has the dirtiest data you’ve ever seen!

Real-world data is messy!

Working with clean datasets can be unrealistic for practice purposes because it does not teach the most time-consuming exercise, which is data cleaning.

This can be achieved through various methods, from mapping tables, Power Query ETL pipelines, or even directly from SQL. It all depends on your needs and the time you have at hand, but let’s not forget about Excel’s quick Find&Replace tool!

Lastly, in this article, I will guide you through the steps to add common types of errors and imperfections to a clean dataset.

Let’s turn a clean dataset into a dirty one

Clean Dataset	Download Here
Dirty Dataset	Download Here

For this exercise, we will be using a fictitious bike sales dataset. This will serve as the base for our purposes of adding imperfections.

Firstly, it is worth us understanding the data we have at hand, by its columns:

Sale_ID – unique identifier of an event triggered upon a sale
Date – date identifier of the sale event
Customer_ID – unique identifier of the precise customer that triggered the SALE_ID event
Bike_Model – exact model of the bike sold
Price & Quantity – self-explanatory
Store_Location – precise geolocation of the store in said event
Salesperson_ID – unique identifier of the salesperson that sold the product
Payment_Method – means of payment
Customer_Age & Customer_Gender – demographics

For randomizing our error-making, we will be using something called “a helper column”. We will insert a formula to help us randomize the error-making (so it is close to a real-life scenario)

Adding Common Data Entry Errors -> Typographical Errors and Inconsistent Naming

By using =RAND() we will generate a number between 0 and 1. Combining it with an IF statement such as below, we ensure approximately 15% of the records will be available for mutations.

=IF(RAND()<0.15,"Change Value","Not Applicable")

Then simply filter for “Change Value”, you might notice that the values change, that is because RAND() executes on any action (as the calculation option is set to Automatic), we can ignore it.

Make sure to select the column (so changes only affect what is visually present), then use CTRL+F to open the Find and Replace tool. Get creative! Replace values with small text errors, for example:

FIND	REPLACE
Mountain Bike	Mountan Biek
Hybrid Bike	Hibrid Bike
Road Bike	Rodao Bike
Cruiser	Crusair
Folding Bike	Floding Bik

For inconsistent naming, filter the Helper Column for Change Value and Store_Location as below, following the changes:

FROM THIS	TO THIS
Chicago	Chikago
Houston	Huston
New York	N.Y. or NYC
Philadelphia	Philly
San Antonio	S.A.

These changes create challenges upon data aggregation, a quick example can be seen in the pivot below

Introducing Missing Values

Delete data from random cells in key columns like Bike_Model, Store_Location,
Customer_Gender.
You can either create a column using =IF(RAND()<0.1, ” “, E2) to randomly clear 10% of the column’s E data or use the Helper Column to filter and delete random chunks.
Issues like missing values will always destroy the usability of the data, involuntarily creating outliers, values that don’t make sense without context, etc.

Date Formatting Issues

Change data formats across different cells (i.e., some in “dd/mm/yyyy”, and others in “mm/dd/yyyy”).
Simply use the Helper Column, select column B, and press CTRL+1, then head over to Custom and try the following formats for each filtering round:
- mm/dd/yyyy
- mm/d/yyyy
- yyyy/mm/dd
- mm/yy/d

Number Formatting Issues

Change a few cells to different number formats (currency, percentage, text etc.)
This will cause issues on aggregations and formulas, especially where values are treated as text instead of numbers.

Duplicating and Merging Errors

Add duplicates by copying and pasting rows
Duplicates can and surely will skew analysis as they offer an unrealistic view of the dataset
Merge cells in places where data should be in a single column. You might also find when receiving data from SQL, headers will be merged.
Filtering and sorting become difficult when there are merged cells, as it completely disrupts the flow by which Excel works
Merged cells also disrupt how LOOKUP formulas work (for XLOOKUP, pulls all merged columns instead of desired field)

Adding Outliers and Extreme Values

Randomly Add Outliers
- Use =G2*100 to multiply random values in numeric columns, such as Price, Quantity, and Customer_Age with the help of the Helper Column.
- Alternatively, we can randomize which cells become outliers, using =IF(RAND()<0.1, G2*100, G2). This formula randomly increases around 10% of the values by a factor of 100.
- Outliers are values significantly higher or lower than the rest of the data. For example, a few extremely high sales amounts can increase the average, leading you to believe that overall sales are better than they actually are.

Negative Values in Unexpected Places
- Add negative values where they don’t belong, such as Price or Quantity.
- Manually enter negative numbers in a few rows, or use a formula like =IF(RAND()<0.08,-ABS(I2),I2). This formula randomly transforms 8% of values into negatives.
- Negative values can be a red flag during analysis as they can indicate refunds, returns, or cancellations, but when added intentionally they will skew reporting and total figures.

Adding Extra Spaces

Add trailing spaces to text data, which can be hard to detect visually. You can use a formula like =E2&” “. Try and get creative!
Extra spaces can interfere with sorting, filtering, and matching functions. See for example the below pivot, where it cannot aggregate due to additional spaces.

Conclusion

We’ve added different types of errors – outliers, text-numeric mismatches, negative values, missing values, etc. and how to produce those at scale using the simple IF() and RAND() functions.

As real-world data often contains these errors, recognizing these issues is critical in knowing how to spot and correct them in order to produce accurate analysis. Excel offers built-in functions like CLEAN(), TRIM(), IFERROR(), and Conditional Formatting to detect and fix common issues. But let’s not forget about the king of the tools, Power Query!

1 thought on “How To Intentionally Dirty Up Data”

Leave a Comment Cancel reply

Let's talk