Day 3: How Data Engineering Stops Bad Data Before It Becomes Believable
Most data systems do not fail because something crashes.
They fail because something incorrect is allowed to pass quietly.
The pipeline runs. The job succeeds. Tables populate. Dashboards refresh. Nothing looks broken. And yet, somewhere along the way, the data stops representing reality.
By the time someone feels that something is off, the data has already influenced decisions.
That is the failure mode data engineering exists to prevent.
Day 1 of this series explained why data engineering matters at all.
Day 2 showed how raw data is shaped into structured form.
Day 3 focuses on the moment where a system decides what it will accept as truth.
Why Validation Is the First Real Line of Defense
Without validation, a data pipeline is only transportation.
It moves information from one place to another. It does not evaluate it. It does not question it. It does not protect meaning.
Many pipelines look healthy on the surface. They run on schedule. They complete without errors. Monitoring shows green lights everywhere.
But those systems have never been taught how to reject data.
They accept everything.
Data engineering begins when a system is given rules and is expected to enforce them consistently.
What Validation Actually Does
Validation answers a simple question.
Does this record represent something that is allowed to exist in our system.
That question sounds obvious. It rarely is.
In real systems, data arrives from applications, services, third parties, and sensors. Each source has failure modes. Network issues. Partial writes. Logic bugs. User behavior that was never anticipated.
Validation is the boundary where chaos meets intent.
The Raw Data We Receive
We continue with the same example. A daily signup file arriving from an application.
Raw input file
raw_signups.csv
user_id,signup_time,source,age
201,2026-01-11 09:10:11,google,29
202,2026-01-11 09:40:00,facebook,
203,2026-01-11 09:55:42,google,17
204,invalid_time,twitter,25
205,2026-01-11 10:30:00,,42This file is not clean. That is expected.
What matters is that it is preserved exactly as received.
Raw data is evidence. It should never be rewritten or silently fixed.
Many systems skip this step. They clean data in place. When something goes wrong, there is nothing left to inspect.
Good data engineering always keeps the original version.
Understanding the Problems Hidden Inside
At a glance, the file looks reasonable.
But look closely.
One record has no age.
One record has an age that violates business rules.
One record has an invalid timestamp.
One record is missing a source entirely.
Analytics tools would still process this. SQL engines would not complain. Dashboards would still render numbers.
The danger is not obvious failure. The danger is believable wrongness.
Deciding What Valid Means
Before writing code, rules must exist.
This step is often skipped. Teams rush to implementation. Validation logic becomes scattered and inconsistent.
Instead, rules should be explicit and simple.
For this system, the rules are:
- user_id must exist
- signup_time must be a valid timestamp
- source must not be empty
- age must be present
- age must be greater than or equal to 18
These rules encode how the business defines a valid signup.
They are not technical preferences. They are definitions of reality.
Writing the Validation Logic
Now we enforce these rules in code.
import csv
from datetime import datetime
valid_rows = []
invalid_rows = []
with open("raw_signups.csv", "r") as file:
reader = csv.DictReader(file)
for row in reader:
try:
signup_time = datetime.strptime(
row["signup_time"], "%Y-%m-%d %H:%M:%S"
)
except Exception:
invalid_rows.append(row)
continue
if not row["source"]:
invalid_rows.append(row)
continue
if not row["age"]:
invalid_rows.append(row)
continue
if int(row["age"]) < 18:
invalid_rows.append(row)
continue
valid_rows.append({
"user_id": row["user_id"],
"signup_date": signup_time.date(),
"source": row["source"],
"age": int(row["age"])
})
This code is intentionally plain.
Validation code should be readable by anyone on the team. If validation logic becomes clever, it becomes dangerous.
Every rejection path is explicit. Every assumption is visible.
What Happens After Validation
Validation always produces two outputs.
One dataset that passed.
One dataset that did not.
Both are important.
Valid records
valid_signups.csv
user_id,signup_date,source,age
201,2026-01-11,google,29Only one record survives.
That feels uncomfortable at first.
But comfort is not the goal. Accuracy is.
Invalid records
with open("invalid_signups.csv", "w", newline="") as file:
writer = csv.DictWriter(
file,
fieldnames=["user_id", "signup_time", "source", "age"]
)
writer.writeheader()
writer.writerows(invalid_rows)invalid_signups.csv
user_id,signup_time,source,age
202,2026-01-11 09:40:00,facebook,
203,2026-01-11 09:55:42,google,17
204,invalid_time,twitter,25
205,2026-01-11 10:30:00,,42This file is not a failure log.
It is a diagnostic artifact.
It allows engineers to answer questions later.
Why did numbers change.
Why was a record excluded.
What assumptions no longer hold.
Why Rejecting Data Feels Risky
Teams often resist strict validation.
They worry about losing information. They worry about explaining discrepancies. They worry about breaking reports.
Those concerns are understandable.
But accepting broken data shifts the risk downstream. It hides problems until they become harder to diagnose.
Validation surfaces issues early, when context still exists.
Aggregation After Validation
Only validated data is allowed into analytics.
from collections import defaultdict
daily_counts = defaultdict(int)
for row in valid_rows:
key = (row["signup_date"], row["source"])
daily_counts[key] += 1Notice what is missing.
There are no checks here. No defensive coding. No fallback logic.
That is intentional.
Aggregation logic should assume correctness. Validation already did the hard work.
Producing the Final Report
with open("validated_daily_report.csv", "w", newline="") as file:
writer = csv.writer(file)
writer.writerow(["signup_date", "source", "signups"])
for (date, source), count in daily_counts.items():
writer.writerow([date, source, count])Output
signup_date,source,signups
2026-01-11,google,1Without validation, the report would show five signups.
With validation, it shows one.
Both numbers are technically valid.
Only one is defensible.
What Changed Conceptually
Nothing about the infrastructure changed.
No new tools. No frameworks. No services.
What changed was intent.
The system learned how to reject input that violates reality.
That is the essence of data engineering.
How This Scales
In production systems, CSV files become streams.
Python scripts become distributed jobs.
Local files become warehouses.
The logic stays the same.
- raw data is immutable
- validation is explicit
- rejected data is visible
- analytics only sees defended records
Systems that follow this pattern survive growth. Systems that do not accumulate invisible debt.
Where Validation Stops Working
Validation protects structure.
It does not protect meaning forever.
Fields change. Schemas evolve. New values appear. Old assumptions break silently.
A record can be valid and still wrong.
That is the next problem.
What Comes Next
Day 4 will focus on schema changes and how pipelines fail even when validation exists.
We will break this same pipeline by changing field definitions and observe what happens.
Because stopping bad data is only the first step.
Understanding change is the real challenge.
Stay Ahead of What Changes Quietly
If this article introduced a sense of unease, that response is useful. In data systems, discomfort often appears before failure becomes measurable.
Pipelines rarely break because they stop running. They break because inputs change while assumptions remain the same. Invalid records, shifted formats, and silent edge cases continue to flow without triggering errors.
Day 4 will focus on these changes. We will look at schema evolution, backward compatibility, and practical techniques for detecting structural drift before it affects analytics or downstream consumers.
Reliable systems are not built by reacting to outages.
They are built by making change visible and enforcing intent at the right boundaries.
