Monitoring

Each module internally generates its own list of counters using the ThreadData library (which has supports for stats like SUM/RATE/COUNT over various durations). These counters are periodically submitted to the Monitor store (singleton) via replicate queue called LogSampleQueue. A Monitoring agent on the node or remotely can connect to Monitor and query the counters to perform monitoring on the top of these counters or build interactive dashboards.

Further, each module logs important events like IFACE_UP or NEIGHBOR_DOWN in a structured fashion via Monitor which can be logged to data stores like Druid to do real-time monitoring of log events across the fleet.

Understanding Counters


Counters are exported as key-value pairs where the key is string and value is a 64-bit integer. Keys represent either a plain counter like a number of neighbors or statistic like the number of SPF computations in last one minute. Stats are encoded with the following method:

<module-name>.<counter-name>.<stat-type>.<seconds>

For example:

  • kvstore.received_key_vals.sum.60 represents number of key-value updates that happened in past one minute

  • decision.spf_runs.count.3600 represents number of SPF runs in last one hour

  • decision.spf.multipath_ms.avg.0 represents the average execution time across all SPF calculations

Important Counters to Monitor


Here are some important counters to monitor for a production system and alert engineers quickly if something goes wrong. breeze monitor counters will list a lot of other counters as well. Most names are self-explanatory.

KvStore Counters

  • kvstore.num_keys => This counter shouldn’t exceed a certain threshold and must have some max limit for a given network. If the number of keys keeps increasing in KvStore that means the system will soon run out of memory

  • kvstore.peers => Usually every node in a network must have at least one peer

  • kvstore.pending_full_sync => Pending full sync request to a neighbor, this counter should be 0 most of time

Spark Counters

  • spark.num_tracked_interfaces => Indicates the number of interfaces learned by OpenR

  • spark.num_adjacent_neighbors must match with spark.num_tracked_neighbors as eventually adjacency must form between all connected nodes in the domain

Decision Counters

  • decision.adj_db_update.count.60 shouldn’t exceed a certain threshold. A higher number indicates an instable network

  • decision.prefix_db_update.count.60 shouldn’t exceed a certain threshold. Higher number indicates a lot of churn in route advertisement

  • decision.spf_runs.count.60 a higher number indicates a lot of network churn (corresponds to adj_db_update).

Fib Counters

  • fib.convergence_time_ms.avg.60 indicates average convergece time for all events in last one minute.

  • fib.num_routes should correspond to number of unique advertised prefixes across all nodes

Log Events


Along with counters OpenR also publishes certain log events. Each log event is a json sample described as dictionary of

<value-type> => map<key, value>

Each sample has following keys along with more information

  • int key named time indicating timestamp since epoch in number of seconds.

  • string key named domain indicating network name

  • string key named name indicating name of the node

  • string key named event indicating event name

Some important evens names are:

  • ROUTE_CALC

  • ROUTE_UPDATE

  • NB_RTT_CHANGE

  • KEY_EXPIRE

  • IFACE_UPDATE

  • IFACE_DOWN

  • IFACE_UP

  • DECISION_DEBOUNCE

  • ROUTE_CONVERGENCE

  • NB_UP

  • NB_DOWN

  • NB_RESTART

  • ADD_PEER

  • DEL_PEER