Monitoring¶
Each module internally generates its own list of counters using the ThreadData
library (which has supports for stats like SUM/RATE/COUNT over various
durations). These counters are periodically submitted to the Monitor
store
(singleton) via replicate queue called LogSampleQueue
. A Monitoring agent on
the node or remotely can connect to Monitor
and query the counters to perform
monitoring on the top of these counters or build interactive dashboards.
Further, each module logs important events like IFACE_UP
or NEIGHBOR_DOWN
in
a structured fashion via Monitor which can be logged to data stores like Druid
to do real-time monitoring of log events across the fleet.
Understanding Counters¶
Counters are exported as key-value pairs where the key is string and value is a 64-bit integer. Keys represent either a plain counter like a number of neighbors or statistic like the number of SPF computations in last one minute. Stats are encoded with the following method:
<module-name>.<counter-name>.<stat-type>.<seconds>
For example:
kvstore.received_key_vals.sum.60
represents number of key-value updates that happened in past one minutedecision.spf_runs.count.3600
represents number of SPF runs in last one hourdecision.spf.multipath_ms.avg.0
represents the average execution time across all SPF calculations
Important Counters to Monitor¶
Here are some important counters to monitor for a production system and alert
engineers quickly if something goes wrong. breeze monitor counters
will list a
lot of other counters as well. Most names are self-explanatory.
KvStore Counters¶
kvstore.num_keys
=> This counter shouldn’t exceed a certain threshold and must have some max limit for a given network. If the number of keys keeps increasing in KvStore that means the system will soon run out of memorykvstore.peers
=> Usually every node in a network must have at least one peerkvstore.pending_full_sync
=> Pending full sync request to a neighbor, this counter should be 0 most of time
Spark Counters¶
spark.num_tracked_interfaces
=> Indicates the number of interfaces learned by OpenRspark.num_adjacent_neighbors
must match withspark.num_tracked_neighbors
as eventually adjacency must form between all connected nodes in the domain
Decision Counters¶
decision.adj_db_update.count.60
shouldn’t exceed a certain threshold. A higher number indicates an instable networkdecision.prefix_db_update.count.60
shouldn’t exceed a certain threshold. Higher number indicates a lot of churn in route advertisementdecision.spf_runs.count.60
a higher number indicates a lot of network churn (corresponds to adj_db_update).
Fib Counters¶
fib.convergence_time_ms.avg.60
indicates average convergece time for all events in last one minute.fib.num_routes
should correspond to number of unique advertised prefixes across all nodes
Link Monitor Counters¶
link_monitor.advertise_adjacencies.sum.60
=> higher number indicates a lot of adjacency flappinglink_monitor.advertise_links.sum.60
=> higher number indicates a lot of link flapping on system
Log Events¶
Along with counters OpenR also publishes certain log events. Each log event is a json sample described as dictionary of
<value-type> => map<key, value>
Each sample has following keys along with more information
int
key namedtime
indicating timestamp since epoch in number of seconds.string
key nameddomain
indicating network namestring
key namedname
indicating name of the nodestring
key namedevent
indicating event name
Some important evens names are:
ROUTE_CALC
ROUTE_UPDATE
NB_RTT_CHANGE
KEY_EXPIRE
IFACE_UPDATE
IFACE_DOWN
IFACE_UP
DECISION_DEBOUNCE
ROUTE_CONVERGENCE
NB_UP
NB_DOWN
NB_RESTART
ADD_PEER
DEL_PEER