Data Validation In Data Engineering Explained

Data Validation In Data Engineering Explained

Day 3: How Data Engineering Stops Bad Data Before It Becomes Believable

Most data systems do not fail because something crashes.

They fail because something incorrect is allowed to pass quietly.

The pipeline runs. The job succeeds. Tables populate. Dashboards refresh. Nothing looks broken. And yet, somewhere along the way, the data stops representing reality.

By the time someone feels that something is off, the data has already influenced decisions.

That is the failure mode data engineering exists to prevent.

Day 1 of this series explained why data engineering matters at all.

Day 2 showed how raw data is shaped into structured form.

Day 3 focuses on the moment where a system decides what it will accept as truth.

Why Validation Is the First Real Line of Defense

Without validation, a data pipeline is only transportation.

It moves information from one place to another. It does not evaluate it. It does not question it. It does not protect meaning.

Many pipelines look healthy on the surface. They run on schedule. They complete without errors. Monitoring shows green lights everywhere.

But those systems have never been taught how to reject data.

They accept everything.

Data engineering begins when a system is given rules and is expected to enforce them consistently.

What Validation Actually Does

Validation answers a simple question.

Does this record represent something that is allowed to exist in our system.

That question sounds obvious. It rarely is.

In real systems, data arrives from applications, services, third parties, and sensors. Each source has failure modes. Network issues. Partial writes. Logic bugs. User behavior that was never anticipated.

Validation is the boundary where chaos meets intent.

The Raw Data We Receive

We continue with the same example. A daily signup file arriving from an application.

Raw input file

raw_signups.csv

user_id,signup_time,source,age
201,2026-01-11 09:10:11,google,29
202,2026-01-11 09:40:00,facebook,
203,2026-01-11 09:55:42,google,17
204,invalid_time,twitter,25
205,2026-01-11 10:30:00,,42

This file is not clean. That is expected.

What matters is that it is preserved exactly as received.

Raw data is evidence. It should never be rewritten or silently fixed.

Many systems skip this step. They clean data in place. When something goes wrong, there is nothing left to inspect.

Good data engineering always keeps the original version.

Understanding the Problems Hidden Inside

At a glance, the file looks reasonable.

But look closely.

One record has no age.
One record has an age that violates business rules.
One record has an invalid timestamp.
One record is missing a source entirely.

Analytics tools would still process this. SQL engines would not complain. Dashboards would still render numbers.

The danger is not obvious failure. The danger is believable wrongness.

Deciding What Valid Means

Before writing code, rules must exist.

This step is often skipped. Teams rush to implementation. Validation logic becomes scattered and inconsistent.

Instead, rules should be explicit and simple.

For this system, the rules are:

  • user_id must exist
  • signup_time must be a valid timestamp
  • source must not be empty
  • age must be present
  • age must be greater than or equal to 18

These rules encode how the business defines a valid signup.

They are not technical preferences. They are definitions of reality.

Writing the Validation Logic

Now we enforce these rules in code.

import csv
from datetime import datetime

valid_rows = []
invalid_rows = []

with open("raw_signups.csv", "r") as file:
reader = csv.DictReader(file)

for row in reader:
    try:
        signup_time = datetime.strptime(
            row["signup_time"], "%Y-%m-%d %H:%M:%S"
        )
    except Exception:
        invalid_rows.append(row)
        continue

    if not row["source"]:
        invalid_rows.append(row)
        continue

    if not row["age"]:
        invalid_rows.append(row)
        continue

    if int(row["age"]) < 18:
        invalid_rows.append(row)
        continue

    valid_rows.append({
        "user_id": row["user_id"],
        "signup_date": signup_time.date(),
        "source": row["source"],
        "age": int(row["age"])
    })

This code is intentionally plain.

Validation code should be readable by anyone on the team. If validation logic becomes clever, it becomes dangerous.

Every rejection path is explicit. Every assumption is visible.

What Happens After Validation

Validation always produces two outputs.

One dataset that passed.
One dataset that did not.

Both are important.

Valid records

valid_signups.csv

user_id,signup_date,source,age
201,2026-01-11,google,29

Only one record survives.

That feels uncomfortable at first.

But comfort is not the goal. Accuracy is.

Invalid records

with open("invalid_signups.csv", "w", newline="") as file:
    writer = csv.DictWriter(
        file,
        fieldnames=["user_id", "signup_time", "source", "age"]
    )
    writer.writeheader()
    writer.writerows(invalid_rows)

invalid_signups.csv

user_id,signup_time,source,age
202,2026-01-11 09:40:00,facebook,
203,2026-01-11 09:55:42,google,17
204,invalid_time,twitter,25
205,2026-01-11 10:30:00,,42

This file is not a failure log.

It is a diagnostic artifact.

It allows engineers to answer questions later.
Why did numbers change.
Why was a record excluded.
What assumptions no longer hold.

Why Rejecting Data Feels Risky

Teams often resist strict validation.

They worry about losing information. They worry about explaining discrepancies. They worry about breaking reports.

Those concerns are understandable.

But accepting broken data shifts the risk downstream. It hides problems until they become harder to diagnose.

Validation surfaces issues early, when context still exists.

Aggregation After Validation

Only validated data is allowed into analytics.


from collections import defaultdict

daily_counts = defaultdict(int)

for row in valid_rows:
    key = (row["signup_date"], row["source"])
    daily_counts[key] += 1

Notice what is missing.

There are no checks here. No defensive coding. No fallback logic.

That is intentional.

Aggregation logic should assume correctness. Validation already did the hard work.

Producing the Final Report

with open("validated_daily_report.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["signup_date", "source", "signups"])

    for (date, source), count in daily_counts.items():
        writer.writerow([date, source, count])

Output

signup_date,source,signups
2026-01-11,google,1

Without validation, the report would show five signups.

With validation, it shows one.

Both numbers are technically valid.

Only one is defensible.

What Changed Conceptually

Nothing about the infrastructure changed.

No new tools. No frameworks. No services.

What changed was intent.

The system learned how to reject input that violates reality.

That is the essence of data engineering.

How This Scales

In production systems, CSV files become streams.
Python scripts become distributed jobs.
Local files become warehouses.

The logic stays the same.

  • raw data is immutable
  • validation is explicit
  • rejected data is visible
  • analytics only sees defended records

Systems that follow this pattern survive growth. Systems that do not accumulate invisible debt.

Where Validation Stops Working

Validation protects structure.

It does not protect meaning forever.

Fields change. Schemas evolve. New values appear. Old assumptions break silently.

A record can be valid and still wrong.

That is the next problem.

What Comes Next

Day 4 will focus on schema changes and how pipelines fail even when validation exists.

We will break this same pipeline by changing field definitions and observe what happens.

Because stopping bad data is only the first step.

Understanding change is the real challenge.

Stay Ahead of What Changes Quietly

If this article introduced a sense of unease, that response is useful. In data systems, discomfort often appears before failure becomes measurable.

Pipelines rarely break because they stop running. They break because inputs change while assumptions remain the same. Invalid records, shifted formats, and silent edge cases continue to flow without triggering errors.

Day 4 will focus on these changes. We will look at schema evolution, backward compatibility, and practical techniques for detecting structural drift before it affects analytics or downstream consumers.

Reliable systems are not built by reacting to outages.
They are built by making change visible and enforcing intent at the right boundaries.

Post a Comment

Previous Post Next Post