Silky Microservice FrameworkSilky Microservice Framework
Home
Docs
Config
Source
github
gitee
  • 简体中文
  • English
Home
Docs
Config
Source
github
gitee
  • 简体中文
  • English
  • Startup

    • Silky Framework Source Code Analysis
    • Host Construction
    • Service Engine
    • Module System
    • Service & Service Entry Resolution
    • Service Registration
    • Dependency Injection Conventions
    • RPC Service Proxy
  • Runtime

    • Endpoints & Routing
    • Executor Dispatch System
    • Local Executor & Server-Side Filters
    • Remote Executor & RPC Call Chain
    • RPC Server Message Handling
    • Service Governance
    • Cache Interceptor
    • Distributed Transactions (TCC)
    • HTTP Gateway Pipeline
    • Filter Pipeline
    • Polly Resilience Pipeline
    • Endpoint Health Monitor

Overview

Silky continuously tracks the health status of each RPC endpoint (ISilkyEndpoint) at runtime. When an endpoint accumulates consecutive communication failures above a threshold, the framework automatically removes it from the available endpoint list — preventing the load balancer from routing further requests to an unhealthy instance.


Core Components

ComponentTypeResponsibility
IRpcEndpointMonitor / DefaultRpcEndpointMonitorSingletonMaintains health state (IsEnable + UnHealthTimes) for all endpoints; emits state change events
IInvokeMonitor / DefaultInvokeMonitorSingleton (when monitoring enabled)Tracks per-call success/failure on the client side; updates IRpcEndpointMonitor
IServerHandleMonitor / DefaultServerHandleMonitorSingleton (when monitoring enabled)Tracks per-request success/failure on the server side
IRpcEndpointSelector implementationsSingletonSubscribe to endpoint change events; invalidate cached endpoint lists when endpoints are removed

Endpoint Health State

DefaultRpcEndpointMonitor maintains a concurrent dictionary keyed by endpoint:

private ConcurrentDictionary<ISilkyEndpoint, CheckModel> m_checkEndpoints = new();

private class CheckModel
{
    public bool IsEnable { get; set; }      // Whether the endpoint is currently available
    public int UnHealthTimes { get; set; }  // Consecutive unhealthy count
}

Endpoints are added with IsEnable = true, UnHealthTimes = 0 when first registered, triggering an OnAddMonitor event.


Complete Failure → Removal Chain

Step 1: RPC Call Fails

DefaultRemoteCaller.InvokeAsync() calls InvokeMonitor.ExecFail() in the catch block:

catch (Exception ex)
{
    invokeMonitor?.ExecFail(
        (remoteInvokeMessage.ServiceEntryId, selectedRpcEndpoint),
        elapsedMs,
        clientInvokeInfo);
    throw;
}

Step 2: InvokeMonitor Updates Statistics and Changes Endpoint State

DefaultInvokeMonitor.ExecFail() does two things:

  1. Updates instance-level statistics (FaultInvokeCount++, AET calculation)
  2. Calls IRpcEndpointMonitor.ChangeStatus(endpoint, isEnable: false, unHealthCeilingTimes)
public void ExecFail((string, ISilkyEndpoint) item, double elapsedMs, ClientInvokeInfo info)
{
    // Update instance stats
    lock (_monitorProvider.InstanceInvokeInfo)
    {
        _monitorProvider.InstanceInvokeInfo.FaultInvokeCount++;
        _monitorProvider.InstanceInvokeInfo.ConcurrentCount--;
    }

    // Update endpoint health
    _rpcEndpointMonitor.ChangeStatus(item.Item2, isEnable: false,
        unHealthCeilingTimes: _governanceOptions.UnHealthAddressTimesAllowedBeforeRemoving);
}

Step 3: RpcEndpointMonitor Changes State and Emits Event

public void ChangeStatus(ISilkyEndpoint endpoint, bool isEnable, int unHealthCeilingTimes = 0)
{
    var checkModel = m_checkEndpoints[endpoint];
    checkModel.IsEnable = isEnable;

    if (!isEnable)
    {
        checkModel.UnHealthTimes++;

        if (checkModel.UnHealthTimes >= unHealthCeilingTimes)
        {
            // Emit: endpoint is too unhealthy → remove from service list
            OnRemoveInvalidEndpoint?.Invoke(endpoint);
        }
        else
        {
            // Emit: endpoint is temporarily unhealthy → sleep for AddressFuseSleepDurationSeconds
            OnEndpointDisable?.Invoke(endpoint);
        }
    }
    else
    {
        checkModel.UnHealthTimes = 0;
        // Emit: endpoint recovered
        OnEndpointEnable?.Invoke(endpoint);
    }
}

Step 4: Endpoint Selectors React to Events

Each IRpcEndpointSelector implementation subscribes to OnRemoveInvalidEndpoint and clears its cached endpoint list:

// Example: PollingRpcEndpointSelector
_rpcEndpointMonitor.OnRemoveInvalidEndpoint += endpoint =>
{
    // Invalidate local cache for all service entries using this endpoint
    ClearEndpointCache(endpoint);
};

After invalidation, the next selection rebuilds the list from the current healthy endpoints.


Endpoint Recovery

When a previously unhealthy endpoint becomes reachable again (e.g., it restarts and re-registers with the registry):

  1. The registry change event propagates to all nodes
  2. IRpcEndpointMonitor.Monitor(endpoint) is called for the new/recovered endpoint
  3. IsEnable = true, UnHealthTimes = 0
  4. OnEndpointEnable event fires → selectors add the endpoint back to their pools

Monitoring Configuration

ConfigDefaultDescription
Rpc:EnableMonitortrueEnable RPC call monitoring
Rpc:CollectMonitorInfoIntervalSeconds30Interval for collecting and aggregating monitor data
Governance:AddressFuseSleepDurationSeconds60Seconds an endpoint is disabled before being retested
Governance:UnHealthAddressTimesAllowedBeforeRemoving3Consecutive unhealthy counts before permanent removal

Monitoring Data (InvokeInfo)

DefaultInvokeMonitor collects call statistics per service entry per endpoint:

MetricDescription
TotalInvokeCountTotal invocation count
FaultInvokeCountFailed invocation count
ConcurrentCountCurrent concurrent count
AET (Average Elapsed Time)Rolling average response time in milliseconds
MaxConcurrentCountPeak concurrent count observed

These metrics are exposed via Silky.Rpc.Monitor and can be viewed on the Silky Dashboard.

Edit this page
Prev
Polly Resilience Pipeline