Monitoring¶
Each module internally generates its own list of counters using the ThreadData
library (which has supports for stats like SUM/RATE/COUNT over various
durations). These counters are periodically submitted to the Monitor store
(singleton) via replicate queue called LogSampleQueue. A Monitoring agent on
the node or remotely can connect to Monitor and query the counters to perform
monitoring on the top of these counters or build interactive dashboards.
Further, each module logs important events like IFACE_UP or NEIGHBOR_DOWN in
a structured fashion via Monitor which can be logged to data stores like Druid
to do real-time monitoring of log events across the fleet.
Understanding Counters¶
Counters are exported as key-value pairs where the key is string and value is a 64-bit integer. Keys represent either a plain counter like a number of neighbors or statistic like the number of SPF computations in last one minute. Stats are encoded with the following method:
<module-name>.<counter-name>.<stat-type>.<seconds>
For example:
kvstore.received_key_vals.sum.60represents number of key-value updates that happened in past one minutedecision.spf_runs.count.3600represents number of SPF runs in last one hourdecision.spf.multipath_ms.avg.0represents the average execution time across all SPF calculations
Important Counters to Monitor¶
Here are some important counters to monitor for a production system and alert
engineers quickly if something goes wrong. breeze monitor counters will list a
lot of other counters as well. Most names are self-explanatory.
KvStore Counters¶
kvstore.num_keys=> This counter shouldn’t exceed a certain threshold and must have some max limit for a given network. If the number of keys keeps increasing in KvStore that means the system will soon run out of memorykvstore.peers=> Usually every node in a network must have at least one peerkvstore.pending_full_sync=> Pending full sync request to a neighbor, this counter should be 0 most of time
Spark Counters¶
spark.num_tracked_interfaces=> Indicates the number of interfaces learned by OpenRspark.num_adjacent_neighborsmust match withspark.num_tracked_neighborsas eventually adjacency must form between all connected nodes in the domain
Decision Counters¶
decision.adj_db_update.count.60shouldn’t exceed a certain threshold. A higher number indicates an instable networkdecision.prefix_db_update.count.60shouldn’t exceed a certain threshold. Higher number indicates a lot of churn in route advertisementdecision.spf_runs.count.60a higher number indicates a lot of network churn (corresponds to adj_db_update).
Fib Counters¶
fib.convergence_time_ms.avg.60indicates average convergece time for all events in last one minute.fib.num_routesshould correspond to number of unique advertised prefixes across all nodes
Link Monitor Counters¶
link_monitor.advertise_adjacencies.sum.60=> higher number indicates a lot of adjacency flappinglink_monitor.advertise_links.sum.60=> higher number indicates a lot of link flapping on system
Log Events¶
Along with counters OpenR also publishes certain log events. Each log event is a json sample described as dictionary of
<value-type> => map<key, value>
Each sample has following keys along with more information
intkey namedtimeindicating timestamp since epoch in number of seconds.stringkey nameddomainindicating network namestringkey namednameindicating name of the nodestringkey namedeventindicating event name
Some important evens names are:
ROUTE_CALCROUTE_UPDATENB_RTT_CHANGEKEY_EXPIREIFACE_UPDATEIFACE_DOWNIFACE_UPDECISION_DEBOUNCEROUTE_CONVERGENCENB_UPNB_DOWNNB_RESTARTADD_PEERDEL_PEER