Overview
Silky continuously tracks the health status of each RPC endpoint (ISilkyEndpoint) at runtime. When an endpoint accumulates consecutive communication failures above a threshold, the framework automatically removes it from the available endpoint list — preventing the load balancer from routing further requests to an unhealthy instance.
Core Components
| Component | Type | Responsibility |
|---|---|---|
IRpcEndpointMonitor / DefaultRpcEndpointMonitor | Singleton | Maintains health state (IsEnable + UnHealthTimes) for all endpoints; emits state change events |
IInvokeMonitor / DefaultInvokeMonitor | Singleton (when monitoring enabled) | Tracks per-call success/failure on the client side; updates IRpcEndpointMonitor |
IServerHandleMonitor / DefaultServerHandleMonitor | Singleton (when monitoring enabled) | Tracks per-request success/failure on the server side |
IRpcEndpointSelector implementations | Singleton | Subscribe to endpoint change events; invalidate cached endpoint lists when endpoints are removed |
Endpoint Health State
DefaultRpcEndpointMonitor maintains a concurrent dictionary keyed by endpoint:
private ConcurrentDictionary<ISilkyEndpoint, CheckModel> m_checkEndpoints = new();
private class CheckModel
{
public bool IsEnable { get; set; } // Whether the endpoint is currently available
public int UnHealthTimes { get; set; } // Consecutive unhealthy count
}
Endpoints are added with IsEnable = true, UnHealthTimes = 0 when first registered, triggering an OnAddMonitor event.
Complete Failure → Removal Chain
Step 1: RPC Call Fails
DefaultRemoteCaller.InvokeAsync() calls InvokeMonitor.ExecFail() in the catch block:
catch (Exception ex)
{
invokeMonitor?.ExecFail(
(remoteInvokeMessage.ServiceEntryId, selectedRpcEndpoint),
elapsedMs,
clientInvokeInfo);
throw;
}
Step 2: InvokeMonitor Updates Statistics and Changes Endpoint State
DefaultInvokeMonitor.ExecFail() does two things:
- Updates instance-level statistics (
FaultInvokeCount++, AET calculation) - Calls
IRpcEndpointMonitor.ChangeStatus(endpoint, isEnable: false, unHealthCeilingTimes)
public void ExecFail((string, ISilkyEndpoint) item, double elapsedMs, ClientInvokeInfo info)
{
// Update instance stats
lock (_monitorProvider.InstanceInvokeInfo)
{
_monitorProvider.InstanceInvokeInfo.FaultInvokeCount++;
_monitorProvider.InstanceInvokeInfo.ConcurrentCount--;
}
// Update endpoint health
_rpcEndpointMonitor.ChangeStatus(item.Item2, isEnable: false,
unHealthCeilingTimes: _governanceOptions.UnHealthAddressTimesAllowedBeforeRemoving);
}
Step 3: RpcEndpointMonitor Changes State and Emits Event
public void ChangeStatus(ISilkyEndpoint endpoint, bool isEnable, int unHealthCeilingTimes = 0)
{
var checkModel = m_checkEndpoints[endpoint];
checkModel.IsEnable = isEnable;
if (!isEnable)
{
checkModel.UnHealthTimes++;
if (checkModel.UnHealthTimes >= unHealthCeilingTimes)
{
// Emit: endpoint is too unhealthy → remove from service list
OnRemoveInvalidEndpoint?.Invoke(endpoint);
}
else
{
// Emit: endpoint is temporarily unhealthy → sleep for AddressFuseSleepDurationSeconds
OnEndpointDisable?.Invoke(endpoint);
}
}
else
{
checkModel.UnHealthTimes = 0;
// Emit: endpoint recovered
OnEndpointEnable?.Invoke(endpoint);
}
}
Step 4: Endpoint Selectors React to Events
Each IRpcEndpointSelector implementation subscribes to OnRemoveInvalidEndpoint and clears its cached endpoint list:
// Example: PollingRpcEndpointSelector
_rpcEndpointMonitor.OnRemoveInvalidEndpoint += endpoint =>
{
// Invalidate local cache for all service entries using this endpoint
ClearEndpointCache(endpoint);
};
After invalidation, the next selection rebuilds the list from the current healthy endpoints.
Endpoint Recovery
When a previously unhealthy endpoint becomes reachable again (e.g., it restarts and re-registers with the registry):
- The registry change event propagates to all nodes
IRpcEndpointMonitor.Monitor(endpoint)is called for the new/recovered endpointIsEnable = true, UnHealthTimes = 0OnEndpointEnableevent fires → selectors add the endpoint back to their pools
Monitoring Configuration
| Config | Default | Description |
|---|---|---|
Rpc:EnableMonitor | true | Enable RPC call monitoring |
Rpc:CollectMonitorInfoIntervalSeconds | 30 | Interval for collecting and aggregating monitor data |
Governance:AddressFuseSleepDurationSeconds | 60 | Seconds an endpoint is disabled before being retested |
Governance:UnHealthAddressTimesAllowedBeforeRemoving | 3 | Consecutive unhealthy counts before permanent removal |
Monitoring Data (InvokeInfo)
DefaultInvokeMonitor collects call statistics per service entry per endpoint:
| Metric | Description |
|---|---|
TotalInvokeCount | Total invocation count |
FaultInvokeCount | Failed invocation count |
ConcurrentCount | Current concurrent count |
AET (Average Elapsed Time) | Rolling average response time in milliseconds |
MaxConcurrentCount | Peak concurrent count observed |
These metrics are exposed via Silky.Rpc.Monitor and can be viewed on the Silky Dashboard.
