# Monitoring Each module internally generates its own list of counters using the `ThreadData` library (which has supports for stats like SUM/RATE/COUNT over various durations). These counters are periodically submitted to the `Monitor` store (singleton) via replicate queue called `LogSampleQueue`. A Monitoring agent on the node or remotely can connect to `Monitor` and query the counters to perform monitoring on the top of these counters or build interactive dashboards. Further, each module logs important events like `IFACE_UP` or `NEIGHBOR_DOWN` in a structured fashion via Monitor which can be logged to data stores like `Druid` to do real-time monitoring of log events across the fleet. ## Understanding Counters --- Counters are exported as key-value pairs where the key is string and value is a 64-bit integer. Keys represent either a plain counter like a number of neighbors or statistic like the number of SPF computations in last one minute. Stats are encoded with the following method: `...` For example: - `kvstore.received_key_vals.sum.60` represents number of key-value updates that happened in past one minute - `decision.spf_runs.count.3600` represents number of SPF runs in last one hour - `decision.spf.multipath_ms.avg.0` represents the average execution time across all SPF calculations ### Important Counters to Monitor --- Here are some important counters to monitor for a production system and alert engineers quickly if something goes wrong. `breeze monitor counters` will list a lot of other counters as well. Most names are self-explanatory. #### KvStore Counters - `kvstore.num_keys` => This counter shouldn't exceed a certain threshold and must have some max limit for a given network. If the number of keys keeps increasing in KvStore that means the system will soon run out of memory - `kvstore.peers` => Usually every node in a network must have at least one peer - `kvstore.pending_full_sync` => Pending full sync request to a neighbor, this counter should be 0 most of time #### Spark Counters - `spark.num_tracked_interfaces` => Indicates the number of interfaces learned by OpenR - `spark.num_adjacent_neighbors` must match with `spark.num_tracked_neighbors` as eventually adjacency must form between all connected nodes in the domain #### Decision Counters - `decision.adj_db_update.count.60` shouldn't exceed a certain threshold. A higher number indicates an instable network - `decision.prefix_db_update.count.60` shouldn't exceed a certain threshold. Higher number indicates a lot of churn in route advertisement - `decision.spf_runs.count.60` a higher number indicates a lot of network churn (corresponds to adj_db_update). #### Fib Counters - `fib.convergence_time_ms.avg.60` indicates average convergece time for all events in last one minute. - `fib.num_routes` should correspond to number of unique advertised prefixes across all nodes #### Link Monitor Counters - `link_monitor.advertise_adjacencies.sum.60` => higher number indicates a lot of adjacency flapping - `link_monitor.advertise_links.sum.60` => higher number indicates a lot of link flapping on system ## Log Events --- Along with counters OpenR also publishes certain log events. Each log event is a json sample described as dictionary of ` => map` Each sample has following keys along with more information - `int` key named `time` indicating timestamp since epoch in number of seconds. - `string` key named `domain` indicating network name - `string` key named `name` indicating name of the node - `string` key named `event` indicating event name Some important evens names are: - `ROUTE_CALC` - `ROUTE_UPDATE` - `NB_RTT_CHANGE` - `KEY_EXPIRE` - `IFACE_UPDATE` - `IFACE_DOWN` - `IFACE_UP` - `DECISION_DEBOUNCE` - `ROUTE_CONVERGENCE` - `NB_UP` - `NB_DOWN` - `NB_RESTART` - `ADD_PEER` - `DEL_PEER`