Monitoring ve Diagnostik — _cat, _cluster, _nodes | Ücretsiz Elasticsearch: Sıfırdan Uzmanlığa Kursu

Giriş — _cat API, _cluster/health, _nodes/stats ve Hot Threads

Bir arabanın gösterge panelini düşün. Hız, devir, yakıt, motor sıcaklığı, yağ basıncı — hepsi tek bakışta görünür. Kırmızı ışık yanarsa sorun var demek. Ama motor ışığı yandığında "motor bozuk" bilgisi yetmez — tamirciye gidip OBD cihazı bağlatırsın, detaylı hata kodunu öğrenirsin. Sorunun nerede olduğunu bilmeden tamir edemezsin.

Elasticsearch'te monitoring aynı mantık. _cluster/health gösterge panelin — yeşil/sarı/kırmızı bir bakışta görürsün. _cat API'ler hızlı durum raporu — hangi index ne kadar yer kaplıyor, hangi shard nerede. _nodes/stats ise OBD cihazı — CPU, memory, disk, GC detaylarını verir. Ve hot_threads, tamircinin "motoru dinlemesi" — tam olarak hangi thread ne yapıyor.

1. _cluster/health — Cluster Sağlık Durumu

Temel Kullanım

GET _cluster/health

// Örnek yanıt:
{
  "cluster_name": "production-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 150,
  "active_shards": 280,
  "relocating_shards": 2,
  "initializing_shards": 0,
  "unassigned_shards": 20,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 93.3
}

Status Renkleri

Renk	Anlam	Aciliyet
🟢 Green	Tüm primary ve replica shard'lar atanmış	Normal
🟡 Yellow	Tüm primary shard'lar OK, bazı replica'lar atanmamış	Uyarı
🔴 Red	Bazı primary shard'lar atanmamış — veri kaybı riski!	Acil

Detaylı Health (Index Bazında)

// Belirli index'in health durumu
GET _cluster/health/products?level=shards

// Tüm cluster, index seviyesinde
GET _cluster/health?level=indices

// Sadece sarı/kırmızı index'leri görmek için
GET _cluster/health?level=indices&filter_path=indices.*.status

Health Durumu Değişene Kadar Bekleme

// Yellow veya daha iyi olana kadar bekle (timeout 30s)
GET _cluster/health?wait_for_status=yellow&timeout=30s

// Green olana kadar bekle
GET _cluster/health?wait_for_status=green&timeout=60s

Bu endpoint, deployment script'lerinde kullanışlıdır — cluster hazır olana kadar bekler.

2. _cat API — Hızlı Durum Raporu

_cat (Compact and Aligned Text) API'leri, terminal-dostu, insan tarafından okunabilir çıktı verir. JSON yerine tablo formatı. Hızlı kontroller için ideal.

_cat/indices — Index Listesi

GET _cat/indices?v&s=store.size:desc

// Çıktı:
// health status index              pri rep docs.count store.size
// green  open   logs-2024.01       5   1   25000000   45.2gb
// green  open   logs-2024.02       5   1   22000000   41.8gb
// yellow open   products           3   1   500000     2.1gb
// green  open   users              1   1   100000     150mb

Sık kullanılan parametreler:

// Sadece belirli field'lar
GET _cat/indices?v&h=index,docs.count,store.size,health

// Belirli pattern
GET _cat/indices/logs-*?v&s=index

// Bytes cinsinden
GET _cat/indices?v&bytes=mb

_cat/shards — Shard Dağılımı

GET _cat/shards?v&s=store:desc

// Çıktı:
// index          shard prirep state   docs  store node
// logs-2024.01   0     p      STARTED 5000K 9.2gb node-1
// logs-2024.01   0     r      STARTED 5000K 9.2gb node-3
// logs-2024.01   1     p      STARTED 5100K 9.5gb node-2
// products       0     p      STARTED 250K  1.1gb node-1
// products       0     r      UNASSIGNED

// Atanmamış shard'ları göster
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state

// Belirli node'daki shard'lar
GET _cat/shards?v&h=index,shard,prirep,state,store,node&s=node

_cat/nodes — Node Durumu

GET _cat/nodes?v&h=name,ip,role,heap.percent,ram.percent,cpu,load_1m,disk.used_percent,master

// Çıktı:
// name    ip           role  heap.percent ram.percent cpu load_1m disk.used_percent master
// node-1  10.0.1.10    cdfhimrstw  65    85         45  2.30    72                *
// node-2  10.0.1.11    cdfhimrstw  58    80         38  1.85    68
// node-3  10.0.1.12    cdfhimrstw  72    88         52  3.10    75

Rol kodları:

Kod	Rol
`m`	master eligible
`d`	data
`i`	ingest
`c`	coordinating only (diğer roller yoksa)
`r`	remote cluster client
`s`	data_content
`t`	data_hot
`w`	data_warm

_cat/allocation — Disk Dağılımı

GET _cat/allocation?v

// Çıktı:
// shards disk.indices disk.used disk.avail disk.total disk.percent host       node
// 85     120gb        180gb     320gb      500gb      36           10.0.1.10  node-1
// 78     110gb        165gb     335gb      500gb      33           10.0.1.11  node-2
// 87     125gb        190gb     310gb      500gb      38           10.0.1.12  node-3

_cat/tasks — Çalışan Task'ler

GET _cat/tasks?v&detailed

// Uzun süren task'leri göster
GET _cat/tasks?v&h=action,type,running_time,node&s=running_time:desc

_cat/recovery — Shard Recovery Durumu

GET _cat/recovery?v&active_only&h=index,shard,time,type,stage,source_node,target_node,bytes_percent

// Çıktı:
// index        shard time  type      stage source_node target_node bytes_percent
// logs-2024.01 2     3.5m  peer      done  node-1      node-3      100%
// products     0     45s   peer      index node-2      node-1      67.3%

_cat/thread_pool — Thread Pool Durumu

GET _cat/thread_pool?v&h=node_name,name,active,queue,rejected,completed

// Önemli thread pool'lar:
// - search: Arama işlemleri
// - write: Yazma işlemleri (index, update, delete, bulk)
// - get: GET işlemleri
// - management: Cluster yönetimi

// Sadece search ve write pool'ları
GET _cat/thread_pool/search,write?v

rejected sayısı artıyorsa, cluster yetişemiyor demek — ölçekleme gerekir.

3. _nodes/stats — Detaylı Node İstatistikleri

_cat hızlı bakış içindi. _nodes/stats derin analiz için.

Temel Kullanım

// Tüm node'ların tüm istatistikleri (büyük response!)
GET _nodes/stats

// Belirli metrikler
GET _nodes/stats/jvm,os,fs,indices

// Belirli node
GET _nodes/node-1/stats/jvm

JVM İstatistikleri

GET _nodes/stats/jvm

// Önemli alanlar:
{
  "nodes": {
    "node-id-1": {
      "jvm": {
        "mem": {
          "heap_used_in_bytes": 4294967296,
          "heap_used_percent": 65,
          "heap_max_in_bytes": 6442450944
        },
        "gc": {
          "collectors": {
            "young": {
              "collection_count": 15234,
              "collection_time_in_millis": 125000
            },
            "old": {
              "collection_count": 12,
              "collection_time_in_millis": 8500
            }
          }
        }
      }
    }
  }
}

Kritik JVM metrikleri:

Metrik	Sağlıklı Aralık	Alarm
`heap_used_percent`	< 75%	> 85%
Young GC süresi	< 50ms/collection	> 100ms
Old GC süresi	< 1s/collection	> 5s
Old GC sıklığı	Dakikada < 1	Dakikada > 5

💡 İpucu: heap_used_percent sürekli %85 üzerindeyse, heap artırın veya daha fazla node ekleyin. Ama heap 31GB'ı geçmeyin — JVM compressed oops (object pointers) devre dışı kalır ve bellek verimliliği düşer.

OS İstatistikleri

GET _nodes/stats/os

// Önemli alanlar:
{
  "os": {
    "cpu": {
      "percent": 45,
      "load_average": {
        "1m": 2.30,
        "5m": 1.85,
        "15m": 1.50
      }
    },
    "mem": {
      "total_in_bytes": 34359738368,
      "free_in_bytes": 5368709120,
      "used_in_bytes": 28991029248,
      "free_percent": 16,
      "used_percent": 84
    },
    "swap": {
      "total_in_bytes": 0,
      "free_in_bytes": 0,
      "used_in_bytes": 0
    }
  }
}

⚠️ Dikkat: Elasticsearch node'larında swap kapatılmalıdır! Swap kullanımı, stop-the-world GC'yi tetikleyebilir ve node'u çökertebilir.

# Swap kontrolü
sudo swapoff -a

# Veya elasticsearch.yml
bootstrap.memory_lock: true

Filesystem İstatistikleri

GET _nodes/stats/fs

// Önemli alanlar:
{
  "fs": {
    "total": {
      "total_in_bytes": 536870912000,
      "free_in_bytes": 214748364800,
      "available_in_bytes": 190748364800
    },
    "data": [
      {
        "path": "/var/data/elasticsearch",
        "total_in_bytes": 536870912000,
        "free_in_bytes": 214748364800,
        "available_in_bytes": 190748364800
      }
    ]
  }
}

Disk watermark'ları:

Watermark	Varsayılan	Etki
Low	%85	Yeni shard ataması yapılmaz
High	%90	Shard'lar başka node'lara taşınır
Flood stage	%95	Index read-only olur!

Index İstatistikleri

GET _nodes/stats/indices/search,indexing,merge

// Arama performansı:
{
  "indices": {
    "search": {
      "query_total": 1500000,
      "query_time_in_millis": 75000000,
      "query_current": 5,
      "fetch_total": 1500000,
      "fetch_time_in_millis": 15000000,
      "fetch_current": 2
    },
    "indexing": {
      "index_total": 5000000,
      "index_time_in_millis": 250000,
      "index_current": 10,
      "index_failed": 25
    },
    "merges": {
      "current": 2,
      "current_docs": 500000,
      "total_size_in_bytes": 2147483648,
      "total_time_in_millis": 3600000
    }
  }
}

4. _cluster/stats — Cluster Geneli İstatistikler

GET _cluster/stats

// Kritik alanlar:
{
  "indices": {
    "count": 45,
    "shards": {
      "total": 250,
      "primaries": 125
    },
    "docs": {
      "count": 150000000,
      "deleted": 5000000
    },
    "store": {
      "size_in_bytes": 536870912000
    }
  },
  "nodes": {
    "count": {
      "total": 5,
      "data": 3,
      "master": 3,
      "ingest": 2,
      "coordinating_only": 1
    },
    "jvm": {
      "max_uptime_in_millis": 8640000000,
      "mem": {
        "heap_used_in_bytes": 12884901888,
        "heap_max_in_bytes": 19327352832
      }
    }
  }
}

5. Hot Threads — CPU Sorun Tespiti

Cluster yavaşladığında "CPU'yu ne yiyor?" sorusunun cevabı:

GET _nodes/hot_threads

// Belirli node
GET _nodes/node-1/hot_threads

// Parametreler
GET _nodes/hot_threads?threads=5&interval=500ms&type=cpu

Hot Threads Çıktısını Okuma

::: {node-1}{abc123}{10.0.1.10}{10.0.1.10:9300}
   Hot threads at 2024-01-15T14:30:00.000Z, interval=500ms, busiestThreads=3, ignoreIdleSince=-1:
   
   98.5% (492.7ms out of 500ms) cpu usage by thread 'elasticsearch[node-1][search][T#3]'
     10/10 snapshots sharing following 15 elements
       java.base/sun.misc.Unsafe.park(Native Method)
       org.apache.lucene.search.BooleanScorer.score(BooleanScorer.java:253)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
       org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:192)
       ...

Bu çıktıdan görüyoruz ki:

search thread'i CPU'nun %98.5'ini kullanıyor
BooleanScorer yoğun çalışıyor — muhtemelen ağır bir bool query

Type Parametreleri

// CPU yoğun thread'ler
GET _nodes/hot_threads?type=cpu

// Beklemede olan thread'ler (I/O bekliyor olabilir)
GET _nodes/hot_threads?type=wait

// Bloke olan thread'ler (lock contention)
GET _nodes/hot_threads?type=block

6. Cluster Ayarları Kontrolü

Mevcut Ayarları Görüntüleme

// Tüm cluster ayarları
GET _cluster/settings?include_defaults=true&flat_settings=true

// Sadece özelleştirilmiş ayarlar
GET _cluster/settings?flat_settings=true

Dinamik Ayar Değiştirme

// Transient (restart'ta kaybolur)
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all",
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

// Persistent (kalıcı)
PUT _cluster/settings
{
  "persistent": {
    "action.destructive_requires_name": true,
    "cluster.max_shards_per_node": 1500
  }
}

💡 İpucu: action.destructive_requires_name: true her zaman açık olsun. Bu, DELETE * gibi kazara tüm index'leri silmeyi engeller.

7. Pending Tasks ve Task Management

Pending Tasks

Master node'un işlemek için sıraya aldığı görevler:

GET _cluster/pending_tasks

// Yanıt:
{
  "tasks": [
    {
      "insert_order": 101,
      "priority": "URGENT",
      "source": "create-index [logs-2024.01-000042]",
      "executing": true,
      "time_in_queue_millis": 50,
      "time_in_queue": "50ms"
    }
  ]
}

Pending tasks birikiyorsa, master node yetişemiyor demek — aşırı index oluşturma, mapping değişikliği veya shard reallocation olabilir.

Task Management API

// Tüm çalışan task'ler
GET _tasks

// Detaylı, belirli action
GET _tasks?detailed=true&actions=*reindex

// Uzun süren task'ler (1 dakikadan fazla)
GET _tasks?detailed=true&group_by=parents

// Task iptal
POST _tasks/node-1:12345/_cancel

8. Index-Level Diagnostik

Index Stats

GET products/_stats

// Belirli metrikler
GET products/_stats/search,indexing,merge,refresh,flush,segments

// Önemli çıktı alanları:
{
  "_all": {
    "primaries": {
      "search": {
        "query_total": 500000,
        "query_time_in_millis": 25000000
      },
      "segments": {
        "count": 45,
        "memory_in_bytes": 52428800,
        "max_unsafe_auto_id_timestamp": -1
      }
    }
  }
}

Segment Bilgisi

GET _cat/segments/products?v&h=index,shard,segment,generation,docs.count,size,compound

// Çıktı:
// index    shard segment generation docs.count size    compound
// products 0     _0      0          250000     1.1gb   false
// products 0     _1      1          50000      220mb   false
// products 0     _2      2          5000       22mb    true

Shard Store Durumu

// Shard'ların disk üzerindeki durumu
GET products/_shard_stores

// Sorunlu shard'lar
GET products/_shard_stores?status=red,yellow

9. Java ile Monitoring

Cluster Health Check

import co.elastic.clients.elasticsearch.ElasticsearchClient;
import co.elastic.clients.elasticsearch.cluster.HealthResponse;
import co.elastic.clients.elasticsearch.cluster.HealthStatus;

public class ClusterMonitor {
    private final ElasticsearchClient client;

    public ClusterMonitor(ElasticsearchClient client) {
        this.client = client;
    }

    public void checkHealth() throws Exception {
        HealthResponse health = client.cluster().health();

        System.out.println("Cluster: " + health.clusterName());
        System.out.println("Status: " + health.status());
        System.out.println("Nodes: " + health.numberOfNodes());
        System.out.println("Active Shards: " + health.activeShards());
        System.out.println("Unassigned: " + health.unassignedShards());
        System.out.println("Pending Tasks: " + health.numberOfPendingTasks());

        if (health.status() == HealthStatus.Red) {
            System.err.println("🔴 CRITICAL: Cluster is RED!");
            System.err.println("Unassigned shards: " + health.unassignedShards());
        } else if (health.status() == HealthStatus.Yellow) {
            System.out.println("🟡 WARNING: Some replicas unassigned");
        } else {
            System.out.println("🟢 OK: Cluster is healthy");
        }
    }

    public void waitForGreen(int timeoutSeconds) throws Exception {
        HealthResponse health = client.cluster().health(h -> h
            .waitForStatus(HealthStatus.Green)
            .timeout(t -> t.time(timeoutSeconds + "s"))
        );

        if (health.timedOut()) {
            System.err.println("Timeout! Cluster did not reach green status");
        }
    }
}

Node Stats Dashboard

import co.elastic.clients.elasticsearch.nodes.StatsResponse;
import co.elastic.clients.elasticsearch.nodes.Stats;

public void printNodeDashboard() throws Exception {
    StatsResponse stats = client.nodes().stats(s -> s
        .metric("jvm", "os", "fs", "indices")
    );

    stats.nodes().forEach((nodeId, nodeStats) -> {
        System.out.println("=== " + nodeStats.name() + " ===");

        // JVM
        var jvm = nodeStats.jvm();
        if (jvm != null && jvm.mem() != null) {
            System.out.printf("  Heap: %d%% (%s / %s)%n",
                jvm.mem().heapUsedPercent(),
                formatBytes(jvm.mem().heapUsedInBytes()),
                formatBytes(jvm.mem().heapMaxInBytes()));
        }

        // OS
        var os = nodeStats.os();
        if (os != null && os.cpu() != null) {
            System.out.printf("  CPU: %d%%  Load: %.2f%n",
                os.cpu().percent(),
                os.cpu().loadAverage().get("1m"));
        }

        // Disk
        var fs = nodeStats.fs();
        if (fs != null && fs.total() != null) {
            long total = fs.total().totalInBytes();
            long avail = fs.total().availableInBytes();
            long usedPercent = ((total - avail) * 100) / total;
            System.out.printf("  Disk: %d%% used (%s available)%n",
                usedPercent, formatBytes(avail));
        }

        // Search stats
        var indices = nodeStats.indices();
        if (indices != null && indices.search() != null) {
            var search = indices.search();
            long avgQueryTime = search.queryTotal() > 0
                ? search.queryTimeInMillis() / search.queryTotal()
                : 0;
            System.out.printf("  Search: %d queries, avg %dms%n",
                search.queryTotal(), avgQueryTime);
        }
    });
}

private String formatBytes(long bytes) {
    if (bytes < 1024) return bytes + "B";
    if (bytes < 1048576) return (bytes / 1024) + "KB";
    if (bytes < 1073741824) return (bytes / 1048576) + "MB";
    return String.format("%.1fGB", bytes / 1073741824.0);
}

10. Monitoring Stack (Elastic Stack Monitoring)

Self-Monitoring Ayarlama

// Monitoring verilerini local cluster'a yaz
PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": true,
    "xpack.monitoring.collection.interval": "10s",
    "xpack.monitoring.elasticsearch.collection.enabled": true
  }
}

Ayrı Monitoring Cluster (Önerilen)

Production cluster'ın monitoring verisini ayrı bir cluster'a göndermek best practice'tir:

# elasticsearch.yml (production node)
xpack.monitoring.exporters:
  monitoring_cluster:
    type: http
    host: ["https://monitoring-es:9200"]
    auth.username: monitoring_user
    auth.password: changeme
    ssl.certificate_authorities: ["/path/to/ca.crt"]

Metricbeat ile Monitoring (Yeni Yöntem)

Elastic 7.x+ ile önerilen yöntem: Metricbeat ile monitoring verisi toplamak.

# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
  metricsets:
    - node
    - node_stats
    - index
    - index_recovery
    - index_summary
    - shard
    - cluster_stats
  period: 10s
  hosts: ["https://localhost:9200"]
  username: "monitoring_user"
  password: "changeme"
  ssl.certificate_authorities: ["/path/to/ca.crt"]

output.elasticsearch:
  hosts: ["https://monitoring-cluster:9200"]
  username: "monitoring_writer"
  password: "changeme"

11. Alerting — Otomatik Uyarı

Watcher ile Alert (Elasticsearch Built-in)

PUT _watcher/watch/cluster-health-alert
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_cluster/health",
        "scheme": "https",
        "auth": {
          "basic": {
            "username": "elastic",
            "password": "changeme"
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.status": {
        "eq": "red"
      }
    }
  },
  "actions": {
    "notify_ops": {
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
        "body": "{\"text\": \"🔴 ALERT: Elasticsearch cluster is RED! Unassigned shards: {{ctx.payload.unassigned_shards}}\"}"
      }
    }
  }
}

Disk Watermark Alert

PUT _watcher/watch/disk-space-alert
{
  "trigger": {
    "schedule": { "interval": "5m" }
  },
  "input": {
    "http": {
      "request": {
        "host": "localhost",
        "port": 9200,
        "path": "/_nodes/stats/fs",
        "scheme": "https",
        "auth": {
          "basic": {
            "username": "elastic",
            "password": "changeme"
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": """
        for (entry in ctx.payload.nodes.entrySet()) {
          def fs = entry.getValue().fs.total;
          def usedPercent = ((fs.total_in_bytes - fs.available_in_bytes) * 100) / fs.total_in_bytes;
          if (usedPercent > 80) return true;
        }
        return false;
      """
    }
  },
  "actions": {
    "notify_ops": {
      "webhook": {
        "method": "POST",
        "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
        "body": "{\"text\": \"⚠️ Disk usage above 80% on one or more nodes!\"}"
      }
    }
  }
}

12. Best Practices

✅ Yap

Konu	Öneri
`_cluster/health`	Her deploy sonrası kontrol et
Slow log	Production'da her zaman açık
Monitoring cluster	Production cluster'dan ayrı tutun
Disk watermark	%80'de alert, %85'te aksiyon
Heap monitoring	%75 üstü alarm, %85 acil
Hot threads	CPU spike'larda kontrol et
`_cat` API'ler	Günlük hızlı kontroller için
Alerting	Cluster health, disk, heap için otomatik uyarı

❌ Yapma

Konu	Neden
Monitoring verisi aynı cluster'da	Cluster sorunluyken monitoring da ölür
Yellow status'u ignore etme	Replica yoksa node düşünce veri kaybolur
Swap açık bırakma	GC thrashing + node timeout
Heap > 31GB	Compressed oops devre dışı, bellek verimsiz
Alert yok	Gece 3'te cluster RED olur, sabah fark edersin

13. Yaygın Hatalar ve Çözümleri

Hata 1: "Cluster health RED ama neden?"

# Adım 1: Hangi index kırmızı?
GET _cluster/health?level=indices&filter_path=indices.*.status

# Adım 2: Hangi shard'lar atanmamış?
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state:asc

# Adım 3: Neden atanmamış?
GET _cluster/allocation/explain
{
  "index": "problematic-index",
  "shard": 0,
  "primary": true
}

Hata 2: "Heap kullanımı sürekli yüksek"

# Adım 1: GC metrikleri
GET _nodes/stats/jvm?filter_path=nodes.*.jvm.gc

# Adım 2: Hangi index'ler en çok bellek kullanıyor?
GET _cat/indices?v&h=index,segments.count,segments.memory&s=segments.memory:desc

# Adım 3: Field data kullanımı
GET _nodes/stats/indices/fielddata?fields=*

# Çözüm: Field data temizle
POST _cache/clear?fielddata=true

Hata 3: "Thread pool rejected"

# Sorun: Yazma veya okuma rejct ediliyor
GET _cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected

# rejected > 0 ise:
# 1. Yazma yoğunluğunu azalt (throttle)
# 2. Queue boyutunu artır (dikkatli — bellek tüketir)
# 3. Node ekle (doğru çözüm)

Hata 4: "Disk watermark flood stage"

# Sorun: Index read-only olmuş
# "cluster_block_exception: index [products] blocked by: [FORBIDDEN/12/index read-only / allow delete]"

# Çözüm 1: Yer aç
DELETE old-unnecessary-index-*

# Çözüm 2: Block'u kaldır (yer açtıktan sonra)
PUT _all/_settings
{
  "index.blocks.read_only_allow_delete": null
}

# Çözüm 3: Watermark eşiklerini geçici artır
PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
  }
}

14. Hızlı Referans: Günlük Kontrol Komutları

# Sabah kontrolü — 5 komut ile cluster durumu
# 1. Cluster health
curl -s localhost:9200/_cluster/health?pretty | jq '.status,.unassigned_shards'

# 2. Node durumu
curl -s localhost:9200/_cat/nodes?v'&'h=name,heap.percent,cpu,disk.used_percent

# 3. Index durumu
curl -s localhost:9200/_cat/indices?v'&'h=health,index,docs.count,store.size'&'s=store.size:desc | head -20

# 4. Thread pool rejected
curl -s localhost:9200/_cat/thread_pool/write,search?v'&'h=node_name,name,rejected

# 5. Pending tasks
curl -s localhost:9200/_cluster/pending_tasks?pretty

Özet

`_cluster/health` cluster'ınızın gösterge paneli — Green/Yellow/Red tek bakışta durumu gösterir. Her deploy sonrası, her sabah kontrol edin.
`_cat` API'ler hızlı, terminal-dostu durum raporu — _cat/indices, _cat/shards, _cat/nodes, _cat/allocation en sık kullanılanlar.
`_nodes/stats` derin analiz aracı — JVM heap, GC, CPU, disk, search/indexing istatistikleri. Sorunun nerede olduğunu bulmanızı sağlar.
Hot threads CPU sorun tespiti için — hangi thread'in ne yaptığını gösterir. CPU spike'larda ilk bakılacak yer.
Monitoring verisi ayrı cluster'da tutulmalı — production cluster sorunluyken monitoring da etkilenmesin.
Alerting olmazsa olmaz — cluster health, disk kullanımı, heap kullanımı ve thread pool rejection için otomatik uyarı kurun. Gece 3'te alarm çalması, sabah 9'da fark etmekten iyidir.