Troubleshooting Hot Node Disk Issues in Elastic: When Usage Doesn’t Add Up

Learn to diagnose and prevent Elasticsearch hot node disk spikes with Hyperflex’s expert tips on ILM and cluster health.

Introduction

When managing large Elasticsearch clusters, sudden disk usage spikes on individual hot nodes can pose serious risks—from degraded performance to total write-block scenarios. The challenge? These issues often remain invisible in the cluster health API, yet threaten cluster stability.

This blog walks you through a representative real-world Elastic issue involving unexpected hot node disk growth—uncovering misaligned ILM behavior, lingering index data, and practical mitigation strategies.

The Hot Node Disk Dilemma

Elastic's architecture relies on tiered storage—hot, warm, cold and frozen—designed to optimize performance and cost. But what happens when a hot node shows 90%+ disk usage, while peers hover at 35%?

Even more concerning: the number of shards looks balanced, ILM seems correctly applied, and the cluster is green. So, what’s wrong?

Common Root Causes Behind Unexplained Disk Spikes

Unexpected hot node disk usage is usually a symptom of deeper architectural or operational issues:

🟠Stale Index Data: Residual files not deleted after shard migration

🟠 ILM Rollover Delays: Indices exceeding max_primary_shard_size before rollovers occur
🟠 Searchable Snapshot Residues: Snapshots incorrectly left in the hot tier
🟠 Shard Lock or Merge Conflicts: Blocking cleanup tasks

🟠 Manual Overrides or API Misuse: Human errors causing stuck data

Case Study Breakdown: Diagnosis Without Red Flags

In this scenario:

  • The cluster is green
  • No shards are unassigned
  • ILM policies are applied
  • Yet, one node is using 3x more disk space than its peers

A deeper look reveals:

🔍 Residual Data Still on Disk

du/df output shows hundreds of gigabytes in index folders that no longer have assigned shards.

$ du -sh /var/lib/elasticsearch/nodes/0/indices/*
104G ./irjtyvmUT7mx2gaa-jnvfQ  ← index already moved to warm
 81G ./GiGP1vSTTZup39l1fo5cVg  ← snapshot or orphaned

These indices had successfully rolled over, but their physical data wasn’t cleared from the hot node disk.

Technical Walkthrough: How to Investigate

Here’s how to get to the root of such issues:

1. Check Disk Usage Consistency

GET _cat/allocation?v&h=shards,disk.indices,disk.used,node

Compare disk.indices vs disk.used. A large discrepancy = possible leftover data.

2. Identify Orphaned Index Folders

du -sh /var/lib/elasticsearch/nodes/0/indices/* | sort -h

3. Use ILM Explain

GET <index-name>/_ilm/explain

Check if the index is truly in a later phase (warm/frozen) and why it hasn’t rolled over.

4. Review ILM Policies

Ensure max_primary_shard_size is enforced properly. Note that rollovers only happen after ILM polling—by default every 10 minutes.

5. Cluster Routing Workarounds

To temporarily rebalance:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.exclude._name": "problem-node-name"
  }
}
Or use disk-based allocation thresholds:
"cluster.routing.allocation.disk.watermark.low": "75%",
"cluster.routing.allocation.disk.watermark.high": "85%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%"

Preventative Tactics and Configurations

To prevent such disk anomalies:

✅ Set max_headroom values to maintain breathing room

✅ Reduce ILM polling interval if ingesting heavy data (e.g., from security tools)

✅ Audit du/df regularly on hot nodes

✅ Monitor shard size vs policy limits

✅ Avoid manual index movements that bypass ILM

How Hyperflex Helps: Elastic Expertise on Demand

At Hyperflex, we’ve helped dozens of Elastic users and teams diagnose and resolve cluster-level issues like this.

We specialize in:

  • 🔧 ILM audits & policy enforcement
  • 🔄 Hot/warm/cold/frozen tier optimization
  • 🚑 Emergency troubleshooting
  • 📈 Shard sizing strategies
  • 🛡️ Ongoing Elasticsearch Consulting Services

Need expert help diagnosing your cluster? We’re just a call away.

Final Takeaway

Disk usage issues on hot nodes often require deeper analysis than surface-level cluster health stats. They can be caused by leftover index folders, delayed rollovers, or operational blind spots.

With the right tools and experience, these problems are fixable—and preventable.

Hyperflex helps teams scale Elastic fast—with confidence.

Contact us to explore how we can support your Elastic journey.