How To Improve Incident Routing With Cloudaware CMDB and PagerDuty

Challenge

Cloudaware
2 min readNov 6, 2019

--

Getting the right people involved during an incident is probably the most important factor in how fast the incident gets resolved. However, as cloud environments sprawl across multiple providers, accounts, subscriptions and as cloud providers add more services, tapping into the right resources can become a challenge.

The events from monitoring tools such as NewRelic, Datadog, etc. that enter PagerDuty, often have little or no structured data that would allow us to fully exploit all of the PagerDuty’s awesome features such as:

  • Filtering
  • Routing
  • Grouping
  • Escalating
  • Response Plays

Solution

Instead of sending events signals into PagerDuty directly, consider passing them through a CMDB, such as Cloudaware. CMDB has a trove of information about the resource involved in the incident from business contacts, mission criticality, components and layers.

PagerDuty uses its own Common Event Format

Cloudaware CMDB takes full advantage of this event format and enriches event data with details specific to AWS, Azure and Google Cloud.

For example if an event above enters PagerDuty, it can now make much better routing decisions based on details such as AWS Account ID and what impacted components is AWS DynamoDB. We can exploit the Layer attribute to decide which on-call schedule should be invoked.

Other benefits of passing events through CMDB, such as Cloudaware, is that we create CI-centric view of the incidents. This is necessary to be able to answer questions, such as, for example - “Which instance or load balancer or EC2 instance has had most issues in the last 30 days?”

CMDB list view showing number of PDuty Incidents per CMDB CI

There is tremendous value in understanding history of incidents for each asset in CMDB. Such feature will help to identify chronic issues, perform root cause analysis and avoid repeating mistakes.

--

--