Skip to main content

Latency vs Throughput

In system design and performance engineering, latency and throughput are two fundamental metrics that describe how well a system performs. While often discussed together, they measure entirely different aspects of performance. Optimizing one does not automatically improve the other, and many architectural decisions involve trading one for the other. Understanding the distinction is essential for designing systems that meet user expectations and business requirements.

What Is Latency?

Latency measures the time it takes for a single operation to complete—from request initiation to final response. It is the delay perceived by a user or a calling service. Lower latency means a faster, more responsive system.

Common sources of latency include:

  • Network transmission time – The time data travels over the network.
  • DNS lookup and TLS handshake – Initial connection setup overhead.
  • Application processing – Business logic and computation.
  • Database queries – Query execution and result fetching.
  • Disk I/O – Reading from or writing to persistent storage.
  • Cache misses – Additional round trips when data is not in cache.
  • Serialization / deserialization – Converting data formats.
  • Garbage collection pauses – Runtime memory management delays.
  • Lock contention – Waiting for shared resources.

Latency is typically reported in milliseconds (ms) or microseconds (μs). Because averages can be misleading, engineers focus on percentiles:

PercentileMeaning
P50 (median)Half of requests are faster than this value.
P9595% of requests are faster; 5% are slower.
P9999% of requests are faster; 1% are slower.
P99.999.9% of requests are faster; 0.1% are slower.

Tail latency (P99, P99.9) is crucial in distributed systems because a single user request often depends on multiple backend calls. Even if most calls are fast, a few slow ones can dominate the overall user experience.

What Is Throughput?

Throughput measures the amount of work a system can complete in a given time period. It reflects system capacity, not response speed. A system with high throughput can handle many requests per second, even if each individual request takes some time.

Throughput is expressed in units such as:

  • Requests per second (RPS)
  • Transactions per second (TPS)
  • Queries per second (QPS)
  • Messages per second
  • Data transfer rate (e.g., MB/s, Gbps)

Throughput is primarily limited by resource capacity: CPU cores, memory bandwidth, I/O bandwidth, and the ability to parallelize work. While latency focuses on how fast, throughput focuses on how many.

Latency vs Throughput

AspectLatencyThroughput
DefinitionTime to complete one operation.Number of operations per time unit.
MeasurementMilliseconds, microseconds.RPS, TPS, QPS, MB/s.
User ImpactFelt as responsiveness or delay.Felt as ability to serve many users concurrently.
Optimization GoalReduce response time.Increase processing capacity.
Typical BottlenecksNetwork, serialization, lock contention, disk I/O.CPU, memory bandwidth, database connections, queue depth.
Example MetricP95 read latency < 100 ms.10,000 requests per second sustained.
Real-World AnalogyHow long a car takes to travel a road.How many cars the road can handle per hour.

They are complementary: a system with excellent latency but poor throughput can serve one user fast, but collapses under concurrent load. Conversely, a system with high throughput but high latency may process millions of jobs per day, but each job takes a long time—unacceptable for interactive applications.

Real-World Examples

Example 1: Search Engine

When a user types a query, they expect results in under a few hundred milliseconds. Latency is paramount. Even if the system can handle millions of queries per hour (high throughput), a slow response time would frustrate users immediately.

Example 2: Batch Processing Platform

A nightly data pipeline that processes terabytes of logs is measured by throughput—how many records per second it can handle. A single record taking 100 ms is fine; the goal is maximizing overall throughput.

Example 3: Video Streaming Platform

Video streaming requires balancing both. Throughput must be high enough to deliver video data without buffering for millions of concurrent viewers, but startup latency (time to first frame) must be low for a good user experience.

Example 4: Online Payment System

Both latency and throughput are critical. Payment authorization must be fast (low latency) to avoid cart abandonment, but the system must also handle peak transaction volumes during flash sales (high throughput).

Sources of Latency

End-to-end latency is the sum of many components. The following diagram illustrates a typical request path:

Each step adds latency. Reducing latency requires optimizing the entire chain, not just one component.

Factors Affecting Throughput

Throughput is influenced by how effectively a system uses its resources:

  • CPU capacity and parallelism – Multiple cores and efficient concurrency increase throughput.
  • Thread pools and connection pools – Reusing expensive resources reduces overhead.
  • Database performance – Indexing, query planning, and connection limits.
  • Queue length – Bounded queues allow graceful backpressure; unbounded queues can hide latency issues.
  • Resource contention – Lock contention or shared resources limit parallel execution.
  • Horizontal scaling – Adding more instances typically increases aggregate throughput linearly.

Throughput bottlenecks often appear at the database or network tier when the number of concurrent operations exceeds what a single instance can handle.

Latency vs Throughput Trade-offs

Improving one metric often hurts the other:

  • Batching – Processing requests in batches increases throughput (amortized overhead) but adds latency because the first request waits for others to form a batch.
  • Asynchronous messaging – Decoupling services via queues improves throughput and resilience, but end‑to‑end latency increases because of queuing and processing delays.
  • Compression – Reducing network bandwidth reduces latency for data transfer but adds CPU overhead, which can reduce throughput.
  • Strong consistency – Synchronous replication and distributed transactions ensure correctness (reliability) but increase latency and reduce throughput compared to eventual consistency.
  • Caching – Reduces both latency and database load (improving throughput), but caching stale data can violate consistency requirements.

Architects must decide which metric is more important for a given subsystem and optimize accordingly.

Little’s Law

Little’s Law is a fundamental theorem in queueing theory that relates latency, throughput, and concurrency:

L = λ × W

Where:

  • L = Average number of requests in the system (concurrent requests being processed).
  • λ = Average arrival rate (throughput, requests per second).
  • W = Average time a request spends in the system (latency).

Practical Example: Suppose you observe that your application server maintains an average of 20 concurrent requests (L = 20) and the average latency is 200 ms (W = 0.2 s). The throughput is:

λ = L / W = 20 / 0.2 = 100 requests/second

If traffic spikes to 200 req/s, either latency must decrease or the system must handle more concurrent requests (e.g., by scaling out). Little’s Law helps estimate capacity and diagnose performance issues.

Performance Optimization Techniques

Reducing Latency

  • Caching – Store results in memory or at the edge (CDN) to avoid recomputation and network trips.
  • Database indexing – Speed up queries by avoiding full scans.
  • Connection pooling – Reuse network connections rather than establishing new ones per request.
  • Compression – Reduce data size over the wire (trades CPU for network latency).
  • Edge computing – Process requests close to users geographically.
  • Asking fewer downstream calls – Reduce fan‑out and dependencies.

Improving Throughput

  • Horizontal scaling – Add more instances behind a load balancer.
  • Parallel processing – Split work across multiple threads or processes.
  • Asynchronous processing – Offload heavy work to background queues.
  • Message queues – Decouple services and buffer requests during spikes.
  • Batching – Process multiple items together to amortize overhead.
  • Resource tuning – Optimize thread pools, database connections, and JVM/garbage collector settings.

Latency in Distributed Systems

Distributed systems introduce additional latency sources:

  • Network hops – Every inter‑service call adds tens of milliseconds.
  • Cross‑region communication – Geographic distance (e.g., US to Europe) adds ~50–100 ms.
  • Serialization / deserialization – JSON, Protobuf, or Avro overhead.
  • Service discovery and load balancing – Additional lookups.
  • Fan‑out requests – Aggregating data from multiple services and waiting for the slowest.

Techniques to reduce distributed latency include using faster serialization (Protobuf), reducing call depth, batching requests, and deploying services in the same region or availability zone.

The following diagram shows how a user-facing request fans out to multiple services, contributing to latency:

The overall latency is determined by the slowest of these parallel calls plus serial dependencies.

Throughput in Distributed Systems

Distributed systems increase throughput by partitioning work:

  • Horizontal scaling – Run many stateless service instances behind a load balancer.
  • Data partitioning (sharding) – Split a database across nodes so each handles a subset of traffic.
  • Replication – Serve read traffic from multiple read replicas.
  • Load balancing – Distribute requests evenly.
  • Queue‑based architectures – A queue absorbs bursts, and multiple workers process messages concurrently.
  • Event‑driven systems – Producers and consumers scale independently.

These patterns allow aggregate throughput to grow nearly linearly with the number of nodes, provided the system is designed for horizontal scalability.

Relationship with Other Foundations Topics

Latency and throughput influence and are influenced by many other system design concepts:

  • Scalability – High throughput often requires horizontal scaling; low latency often relies on vertical scaling or caching.
  • Availability – Failover mechanisms can add latency but are essential for uptime.
  • Reliability – Retries and idempotency add latency but protect against failures.
  • Consistency – Strong consistency adds latency due to coordination; eventual consistency improves latency and throughput.
  • Replication – More replicas increase read throughput but may increase write latency (synchronous replication).
  • Partitioning – Splitting data improves write throughput but can increase cross‑partition query latency.

Architecture Best Practices

  • Measure before optimizing – Profile the system to identify true bottlenecks.
  • Focus on tail latency – P95 and P99 metrics matter more than averages for user experience.
  • Eliminate bottlenecks – A single slow component limits the entire system.
  • Reduce unnecessary network calls – Batch requests, merge services, or use in‑memory caching.
  • Cache frequently accessed data – At multiple layers (browser, CDN, application, database).
  • Use asynchronous processing when appropriate – Offload long‑running tasks to background workers.
  • Scale horizontally for throughput – Stateless services scale out easily.
  • Monitor continuously – Track latency percentiles and throughput trends, set alerts on SLOs.

Common Mistakes

  • Optimizing average latency while ignoring tail latency. A few slow requests can ruin user experience.
  • Confusing throughput with concurrency – More concurrent connections do not always mean more throughput if the system is bottlenecked on CPU.
  • Ignoring network latency – Distributing a system globally without addressing cross‑region calls.
  • Overusing synchronous communication – Every synchronous call adds latency and coupling.
  • Scaling without identifying bottlenecks – Adding more instances does not help if the database is the bottleneck.
  • Benchmarking unrealistic workloads – Testing with uniform request sizes and no contention gives misleading results.

Interview Perspective

In system design interviews, latency and throughput are core concepts. Interviewers may ask:

  • What is the difference between latency and throughput?
  • Can throughput be increased without reducing latency?
  • Why does batching improve throughput?
  • What is tail latency, and why does it matter?
  • Explain Little’s Law and how you use it.
  • How do you optimize latency in a distributed system?
  • How do you improve throughput for a high‑traffic service?

Demonstrate that you understand the trade‑offs and can apply practical optimization techniques, not just recite definitions.

Summary

  • Latency is the time to complete one operation; throughput is the number of operations per time unit.
  • They are independent metrics: you can have fast but low‑capacity systems, or slow but high‑capacity systems.
  • Latency affects user experience directly; throughput determines how many users can be served concurrently.
  • Common latency sources include network, disk, serialization, and lock contention.
  • Throughput is limited by CPU, memory, and parallelism.
  • Little’s Law (L = λ × W) ties them together and helps estimate system capacity.
  • Optimization involves trade‑offs: batching improves throughput but adds latency; strong consistency adds latency but ensures correctness.
  • Distributed systems use horizontal scaling, partitioning, replication, and caching to balance both.
  • Best practices include focusing on tail latency, eliminating bottlenecks, and continuous performance monitoring.

Balancing latency, throughput, scalability, and reliability according to business requirements is the hallmark of a well‑designed system.

Further Reading

Continue building your foundations: