Running hundreds/thousands of agents and wanting to be able to passively review the state of them.
As the team responsible for owning/managing thousands of agents there is obviously no shortage of active monitoring that alerts of anomalies (too few agents, spikes in volume, et cetera). This won’t be perfect though and will need continuous adjustment.
Being able to have a view of overall health of agents is likely the best form / desire here more-so than sorting but sorting helps solve it if that is more trivial to implement/request.
- Do we have any agents behind on versions? Why? How many?
- Are we running any unexpected OS versions? Did our OS bump miss any hosts?
- Any hosts not taking jobs for some reason?
- Any anomalies in status that our monitoring didn’t pickup or we need to add to a monitor?
An argument could be made to instead export all of this to an Observability product and build this out there.