Endpoint Health Monitor

Overview

Silky continuously tracks the health status of each RPC endpoint (ISilkyEndpoint) at runtime. When an endpoint accumulates consecutive communication failures above a threshold, the framework automatically removes it from the available endpoint list — preventing the load balancer from routing further requests to an unhealthy instance.

Core Components

Component	Type	Responsibility
`IRpcEndpointMonitor` / `DefaultRpcEndpointMonitor`	Singleton	Maintains health state (`IsEnable` + `UnHealthTimes`) for all endpoints; emits state change events
`IInvokeMonitor` / `DefaultInvokeMonitor`	Singleton (when monitoring enabled)	Tracks per-call success/failure on the client side; updates `IRpcEndpointMonitor`
`IServerHandleMonitor` / `DefaultServerHandleMonitor`	Singleton (when monitoring enabled)	Tracks per-request success/failure on the server side
`IRpcEndpointSelector` implementations	Singleton	Subscribe to endpoint change events; invalidate cached endpoint lists when endpoints are removed

Endpoint Health State

DefaultRpcEndpointMonitor maintains a concurrent dictionary keyed by endpoint:

private ConcurrentDictionary<ISilkyEndpoint, CheckModel> m_checkEndpoints = new();

private class CheckModel
{
    public bool IsEnable { get; set; }      // Whether the endpoint is currently available
    public int UnHealthTimes { get; set; }  // Consecutive unhealthy count
}

Endpoints are added with IsEnable = true, UnHealthTimes = 0 when first registered, triggering an OnAddMonitor event.

Complete Failure → Removal Chain

Step 1: RPC Call Fails

DefaultRemoteCaller.InvokeAsync() calls InvokeMonitor.ExecFail() in the catch block:

catch (Exception ex)
{
    invokeMonitor?.ExecFail(
        (remoteInvokeMessage.ServiceEntryId, selectedRpcEndpoint),
        elapsedMs,
        clientInvokeInfo);
    throw;
}

Step 2: InvokeMonitor Updates Statistics and Changes Endpoint State

DefaultInvokeMonitor.ExecFail() does two things:

Updates instance-level statistics (FaultInvokeCount++, AET calculation)
Calls IRpcEndpointMonitor.ChangeStatus(endpoint, isEnable: false, unHealthCeilingTimes)

public void ExecFail((string, ISilkyEndpoint) item, double elapsedMs, ClientInvokeInfo info)
{
    // Update instance stats
    lock (_monitorProvider.InstanceInvokeInfo)
    {
        _monitorProvider.InstanceInvokeInfo.FaultInvokeCount++;
        _monitorProvider.InstanceInvokeInfo.ConcurrentCount--;
    }

    // Update endpoint health
    _rpcEndpointMonitor.ChangeStatus(item.Item2, isEnable: false,
        unHealthCeilingTimes: _governanceOptions.UnHealthAddressTimesAllowedBeforeRemoving);
}

Step 3: RpcEndpointMonitor Changes State and Emits Event

public void ChangeStatus(ISilkyEndpoint endpoint, bool isEnable, int unHealthCeilingTimes = 0)
{
    var checkModel = m_checkEndpoints[endpoint];
    checkModel.IsEnable = isEnable;

    if (!isEnable)
    {
        checkModel.UnHealthTimes++;

        if (checkModel.UnHealthTimes >= unHealthCeilingTimes)
        {
            // Emit: endpoint is too unhealthy → remove from service list
            OnRemoveInvalidEndpoint?.Invoke(endpoint);
        }
        else
        {
            // Emit: endpoint is temporarily unhealthy → sleep for AddressFuseSleepDurationSeconds
            OnEndpointDisable?.Invoke(endpoint);
        }
    }
    else
    {
        checkModel.UnHealthTimes = 0;
        // Emit: endpoint recovered
        OnEndpointEnable?.Invoke(endpoint);
    }
}

Step 4: Endpoint Selectors React to Events

Each IRpcEndpointSelector implementation subscribes to OnRemoveInvalidEndpoint and clears its cached endpoint list:

// Example: PollingRpcEndpointSelector
_rpcEndpointMonitor.OnRemoveInvalidEndpoint += endpoint =>
{
    // Invalidate local cache for all service entries using this endpoint
    ClearEndpointCache(endpoint);
};

After invalidation, the next selection rebuilds the list from the current healthy endpoints.

Endpoint Recovery

When a previously unhealthy endpoint becomes reachable again (e.g., it restarts and re-registers with the registry):

The registry change event propagates to all nodes
IRpcEndpointMonitor.Monitor(endpoint) is called for the new/recovered endpoint
IsEnable = true, UnHealthTimes = 0
OnEndpointEnable event fires → selectors add the endpoint back to their pools

Monitoring Configuration

Config	Default	Description
`Rpc:EnableMonitor`	`true`	Enable RPC call monitoring
`Rpc:CollectMonitorInfoIntervalSeconds`	`30`	Interval for collecting and aggregating monitor data
`Governance:AddressFuseSleepDurationSeconds`	`60`	Seconds an endpoint is disabled before being retested
`Governance:UnHealthAddressTimesAllowedBeforeRemoving`	`3`	Consecutive unhealthy counts before permanent removal

Monitoring Data (`InvokeInfo`)

DefaultInvokeMonitor collects call statistics per service entry per endpoint:

Metric	Description
`TotalInvokeCount`	Total invocation count
`FaultInvokeCount`	Failed invocation count
`ConcurrentCount`	Current concurrent count
`AET` (Average Elapsed Time)	Rolling average response time in milliseconds
`MaxConcurrentCount`	Peak concurrent count observed

These metrics are exposed via Silky.Rpc.Monitor and can be viewed on the Silky Dashboard.