What Interviewers Actually Want to Hear About Load Balancing

I was helping a friend debug why their backend kept dying every time they got featured somewhere. Traffic would spike, one EC2 instance would choke on it, and then everything would just… stop responding. They had good code. The database was fine. The problem was simpler than they thought all the requests were hitting one server and that server just couldn’t keep up.

That’s what load balancing is about. And if you’re preparing for system design interviews, this is one of those topics that comes up almost every time. Not because interviewers love torturing candidates, but because it’s genuinely hard to design any system at scale without thinking about it.

So let me walk through everything, what load balancers actually do, how they decide where to send traffic, what happens when things go wrong, and what interviewers are actually listening for when they ask about this.

What a Load Balancer Actually Does

Basically, a load balancer sits in front of your servers and acts like a traffic cop. Every request that comes in goes to the load balancer first, and the load balancer decides which server should handle it. That’s it. The whole idea.

The client, someone’s browser, a mobile app, whatever, doesn’t know or care which server handles their request. They send it to one address (your load balancer), and the load balancer takes care of the rest.

This solves a few things at once. First, no single server gets buried — if you have five servers and 500 requests per second, ideally each server handles around 100. Second, if one server dies, the load balancer stops sending traffic to itand the others pick up the slack. This is what people mean when they talk about high availability — your system keeps running even when individual parts fail.

There’s also something called horizontal scaling, which is just the idea that instead of buying a bigger and bigger single server (vertical scaling), you add more smaller servers. Load balancers make horizontal scaling possible in practice. Without something managing traffic, you can’t just add servers you’d have to somehow tell every client about every server, which is a mess.

Where the Load Balancer Lives (Layer 4 vs Layer 7)

This comes up in interviews. Don’t skip it.

Layer 4 load balancers work at the transport layer. They look at TCP/UDP info source IP, destination IP, ports and route based on that. They don’t read the actual HTTP request. They’re fast because they’re doing minimal work.

Layer 7 load balancers work at the application layer. They can actually read the HTTP request the URL, the headers, cookies, even the request body. This means you can do much smarter routing. “If the request URL starts with /api/video, send it to the video processing servers. If it starts with /api/auth, send it to the auth servers.” That kind of thing.

Most modern systems use Layer 7 load balancers. AWS ALB (Application Load Balancer) is Layer 7. AWS NLB (Network Load Balancer) is Layer 4. In interviews, if someone asks about routing traffic based on content, you want Layer 7. If they’re asking about raw throughput and latency matters more than routing logic, Layer 4 might be the answer.

How It Decides Where to Send Traffic

This is where interviewers like to dig in. There are several different algorithms and each has a situation where it makes sense.

Round Robin is the simplest one. Request 1 goes to server A. Request 2 goes to server B. Request 3 goes to server C. Then back to A. It’s dumb in a useful way — no state to maintain, easy to understand. Works fine when all your servers are identical and requests are roughly similar in cost.

The problem is requests aren’t always equal. One request might be a simple health check that takes 5ms. Another is a video transcoding job that takes 3 seconds. Round Robin doesn’t know the difference, so a server can end up with ten expensive requests while another is sitting idle.

Weighted Round Robin lets you say “server A should get 60% of traffic and server B should get 40%.” Useful when your servers have different capacities maybe you upgraded some of them and the new ones are beefier.

Least Connections is smarter. Instead of rotating, it sends each new request to the server with the fewest active connections right now. This handles the unequal request problem better. If server A is dealing with a bunch of slow requests, server B gets the next one. In practice this works well, though I’ve seen it cause some weird clustering behavior when connection counts don’t tell the whole story.

IP Hash always sends requests from the same client IP to the same server. This is useful for session stickiness if your app needs to remember state on the server side, you want the same user to hit the same server every time. The downside is that if a server goes down, all those users get reassigned and lose their session anyway.

Least Response Time is similar to least connections but looks at actual response latency. Send the next request to whichever server is responding fastest. More accurate but requires the load balancer to keep track of response times.

Interview tip: Don’t just list these. Be ready to say which one you’d pick and why. “For a stateless API where requests are uniform, round robin is fine. For a backend that does expensive computations, I’d use least connections. If we have sticky sessions and can’t move sessions to a cache yet, IP hash.”

Health Checks — How the Load Balancer Knows a Server Is Dead

This is a common interview follow-up. How does the load balancer know to stop sending traffic to a failed server?

It runs health checks. Every few seconds (configurable — usually 10 to 30 seconds in production), the load balancer sends a request to each server at a specific health check endpoint like /health or /ping. The server responds with a 200 OK if it's healthy.

If a server fails to respond, or responds with an error, the load balancer marks it as unhealthy and stops sending traffic to it. It keeps checking, and when the server starts responding again, it gets added back to the pool.

The details matter here. There’s a threshold — usually a server has to fail two or three consecutive checks before it’s marked down. This avoids flipping a server off and on because of a single blip. Same on the way back — it usually needs two or three successful checks to be considered healthy again.

What should your health check endpoint actually check? A naive /health endpoint that just returns 200 always is almost useless — it'll say the server is healthy even if the database connection is broken. A good health check verifies that the critical dependencies are working: can it reach the database? Is the cache responding? Does it have disk space? I've seen teams burn hours because their app was "healthy" according to the load balancer but the database connection pool was exhausted. The server was technically running but couldn't actually do anything useful.

Sticky Sessions and Why They’re a Problem

Sticky sessions (also called session affinity) means the load balancer always routes a specific user to the same backend server. Usually done via a cookie or IP hash.

The reason people want this: if your server stores session data in memory, the user has to keep hitting the same server or they’ll lose their session. Simple, works for small scale.

The problem: it breaks the whole point of load balancing. If you have 10 users all pinned to server A because that’s who they hit first, server A might be getting 80% of your traffic. Also, if server A goes down, those users lose their sessions anyway.

The better answer in an interview is almost always: don’t use sticky sessions. Store session state somewhere external Redis, a database, whatever fits. Then it doesn’t matter which server handles the request because the session lives outside the servers. All your servers become stateless and interchangeable.

That said, sometimes sticky sessions are unavoidable. Legacy systems, third-party software with in-memory state, situations where you can’t change the application. Interviewers usually want to hear you explain why sticky sessions are a tradeoff, not just that you’d avoid them.

The Load Balancer Itself Can Go Down

Here’s a thing people forget to mention: if you have one load balancer and it goes down, everything goes down. The load balancer that’s supposed to prevent single points of failure is itself a single point of failure.

The solution is to run multiple load balancers. Usually two — a primary and a standby. They share a virtual IP address (VIP). Both load balancers monitor each other. If the primary fails, the standby detects this and claims the VIP for itself. DNS still points to the same IP, so clients don’t notice anything. This is called active-passive HA.

There’s also active-active, where both load balancers are handling traffic simultaneously. Traffic gets split between them, so you get actual capacity as well as redundancy, not just failover.

AWS handles this for you with ELB the load balancer is managed and they handle the redundancy. But understanding how it works underneath is what separates a good answer from a great one in an interview.

Different Types of Load Balancers

Hardware load balancers — physical devices. Things like F5 BIG-IP. Used to be the default in big enterprises. Expensive. Less flexible. Still exist in some regulated industries.

Software load balancers — programs that run on regular servers. HAProxy and Nginx are the big ones. HAProxy is specifically built for this and is very good at it. Nginx handles load balancing as one of its many features. Both are open source and widely used.

Cloud-managed — AWS ELB/ALB/NLB, GCP Load Balancing, Azure Load Balancer. Someone else manages the infrastructure and redundancy. You just configure rules. This is what most companies use now.

DNS load balancing — kind of a different category. Instead of a physical or software device, you configure DNS to return different IP addresses in rotation. Round Robin DNS. Cheap and simple, but DNS caching means you have no real control over where clients actually go. Not good for failover — if a server goes down, clients with a cached DNS entry will keep trying to hit it for however long the TTL is.

For an interview, if they ask “how would you set up load balancing for this service”, DNS-only is usually not the right answer for anything where reliability matters. But it can be part of a global distribution strategy when combined with proper load balancers in each region.

Global Load Balancing and GeoDNS

Okay, now we’re getting into the stuff that comes up when the interviewer wants to see if you’ve thought about global scale.

If your users are in India, the US, and Europe, you don’t want everyone hitting servers in us-east-1. A user in Mumbai talking to a server in Virginia is adding 200ms+ of network time just for the round trip.

GeoDNS solves this at the routing level. When a user in Mumbai looks up your domain, the DNS server returns the IP of your Mumbai/Singapore servers. A user in London gets routed to your Frankfurt or London region. They never even see the same servers.

On top of that, each region has its own load balancers distributing traffic among servers in that region.

AWS Route 53 has latency-based routing and geolocation routing built in. Cloudflare does this too. The tricky part is data if your users are distributed globally but your database is in one region, you’ve just pushed the latency problem from the network layer to the database layer. Global load balancing usually has to come with some conversation about read replicas, CDNs for static content, and possibly multi-region databases.

This part honestly still confuses me sometimes. The failure modes when you have multiple regions split brain, replication lag, traffic shifting during a regional outage there’s a lot going on and I don’t think I’ve seen a design that gets all of it perfectly right. But knowing that the problem exists and being able to name the tradeoffs is usually enough for most interviews.

SSL Termination

One more thing that comes up. SSL termination at the load balancer means the load balancer handles the HTTPS decryption. The client connects to the load balancer over HTTPS. The load balancer decrypts the traffic and then communicates with your backend servers over plain HTTP (within your private network).

This is good for a few reasons. Your backend servers don’t have to spend CPU cycles on encryption/decryption. You manage SSL certificates in one place (on the load balancer) instead of on every server. When you need to update certificates, you do it once.

The concern: traffic between the load balancer and backend servers is unencrypted. If you’re in a trusted private network (a VPC, for example), this is usually acceptable. If you need end-to-end encryption — maybe compliance requires it — you can do SSL re-encryption where the load balancer decrypts and then re-encrypts to the backend. More CPU, more complexity, but it’s there if you need it.

What Interviewers Actually Want to Hear

After sitting through a few mock interviews and helping others prep, here’s what I think they’re listening for:

They want to know you understand that a load balancer is not magic. It’s a component with its own failure modes. You should talk about health checks without being prompted. You should bring up the question of what happens when the load balancer itself fails. You should know that the choice of algorithm depends on your workload.

They also want to see you connect load balancing to other concepts. Load balancing + auto-scaling together is how you handle traffic spikes. Load balancing + stateless services is how you make horizontal scaling actually work. Load balancing + health checks is the basic building block of high availability.

Don’t just describe what a load balancer does — explain the design decisions. Why Layer 7 over Layer 4 here? Why least connections over round robin for this specific case? Why are we avoiding sticky sessions and how are we handling sessions instead?

And one honest thing: interviewers vary a lot on how deep they go. Some just want the basics. Others will push you into the failure modes, the CAP theorem implications of distributed state, the cost tradeoffs of managed vs self-hosted. You can’t fully predict it. What you can do is know the fundamentals cold and then follow wherever the conversation goes.

A Quick Note on the Numbers

One thing that trips people up load balancers add latency too. Not much, usually under 1ms for a well-configured software load balancer, but it’s not zero. At very high frequency trading systems or real-time applications, this matters. For most web applications, it doesn’t.

Also, load balancers have a capacity limit of their own. An AWS ALB can handle tens of thousands of requests per second, which is more than most companies ever need. But a traffic spike can potentially overwhelm even a load balancer. AWS handles this mostly transparently by scaling ALBs automatically, but if you’re running HAProxy on a single box, you need to think about this.

You can have twenty lanes of highway, but if the interchange itself is only two lanes, you still have a bottleneck. A load balancer has to be sized appropriately for the traffic it’s managing, not just the servers behind it.

Before You Go

Load balancing is one of those things where the basic idea is simple enough that it’s easy to think you know it fully. You don’t until you’ve thought through the failure cases.

So if you’re studying for an interview, go through these questions on your own: What happens if all your backend servers fail health checks simultaneously? What happens if your load balancer is in a different availability zone than your servers and the network between them has a partial failure? If you’re using least connections, how does a slow server that has many open but idle connections get treated?

Not because all of those will come up. But because working through them forces you to understand the system rather than memorize it.

That’s usually the difference between a good answer and one that gets you the offer.