Each module internally generates its own list of counters using the
library (which has supports for stats like SUM/RATE/COUNT over various
durations). These counters are periodically submitted to the
(singleton) via replicate queue called
LogSampleQueue. A Monitoring agent on
the node or remotely can connect to
Monitor and query the counters to perform
monitoring on the top of these counters or build interactive dashboards.
Further, each module logs important events like
a structured fashion via Monitor which can be logged to data stores like
to do real-time monitoring of log events across the fleet.
Counters are exported as key-value pairs where the key is string and value is a 64-bit integer. Keys represent either a plain counter like a number of neighbors or statistic like the number of SPF computations in last one minute. Stats are encoded with the following method:
kvstore.received_key_vals.sum.60represents number of key-value updates that happened in past one minute
decision.spf_runs.count.3600represents number of SPF runs in last one hour
decision.spf.multipath_ms.avg.0represents the average execution time across all SPF calculations
Important Counters to Monitor¶
Here are some important counters to monitor for a production system and alert
engineers quickly if something goes wrong.
breeze monitor counters will list a
lot of other counters as well. Most names are self-explanatory.
kvstore.num_keys=> This counter shouldn’t exceed a certain threshold and must have some max limit for a given network. If the number of keys keeps increasing in KvStore that means the system will soon run out of memory
kvstore.peers=> Usually every node in a network must have at least one peer
kvstore.pending_full_sync=> Pending full sync request to a neighbor, this counter should be 0 most of time
spark.num_tracked_interfaces=> Indicates the number of interfaces learned by OpenR
spark.num_adjacent_neighborsmust match with
spark.num_tracked_neighborsas eventually adjacency must form between all connected nodes in the domain
decision.adj_db_update.count.60shouldn’t exceed a certain threshold. A higher number indicates an instable network
decision.prefix_db_update.count.60shouldn’t exceed a certain threshold. Higher number indicates a lot of churn in route advertisement
decision.spf_runs.count.60a higher number indicates a lot of network churn (corresponds to adj_db_update).
fib.convergence_time_ms.avg.60indicates average convergece time for all events in last one minute.
fib.num_routesshould correspond to number of unique advertised prefixes across all nodes
Along with counters OpenR also publishes certain log events. Each log event is a json sample described as dictionary of
<value-type> => map<key, value>
Each sample has following keys along with more information
timeindicating timestamp since epoch in number of seconds.
domainindicating network name
nameindicating name of the node
eventindicating event name
Some important evens names are: