Get to the Root of Cloud Infrastructure Issues with Logs, Alerts and Events in Rovius 1.3.1

 In Accelerite Blog

With the latest release of Rovius 1.3.1, you get a lot of new features. Let’s take a close look at what’s new in the Log Analysis and Alerts-&-Events sections.

The Logs, Alerts & Events section in Rovius Ops Manager provides a single pane of glass for looking at what’s happening across different resources within the cloud infrastructure. The section lets you view user activities, issues and resource failures. These sections are very helpful in performing root cause analysis of issues. To get a complete picture, you will need to correlate logs with alerts and events for a specific time duration. To enable this, these sections provide multiple options for filtering data with different fields across various time durations.

Let’s go through some scenarios and see how different filters help in troubleshooting workflows. Some example scenarios to troubleshoot can be – “VM creation failed”, “storage copy operation not working”, “system VM boot failed”, etc. There can be a variety of reasons for VM creation failure including “storage out of capacity”, “template not found”, “server unreachable”, etc. Here’s how you can leverage Rovius Ops Manager to get to the root cause.

Logs Section

  • Search for errors across all components within the same time duration as a first step. Here you can start the search using fixed time durations such as – today, yesterday, last 7 days, last month. These quick selections help you search errors faster. You can apply a custom range as well to search across specific start and end dates. Further granular drill downs within a day/ hour can be performed subsequently to narrow down the search scope easily.
  • You can focus on specific types of logs using the “Log Type” filter. This option allows searching of cloud component logs across API, Management Server, System VMs, and XenServer/VMWare components. Narrowing down specific cloud component logs helps find relevant logs faster. Depending on “Log Type” selection, further drill down can be performed based on specific component entity logs such as specific console proxy and SSVM logs. To troubleshoot network issues for a given cluster, it’s easy to focus on specific virtual router-id logs for connectivity issues. Similarly, for storage problems, users can narrow down issues using specific SSVM logs.
  • Multiple options are available from the search query box also –
    • Exceptions like “java.lang.OutOfMemoryError” can be looked up directly in management server logs. Future releases will have alerts for errors depending on the number of occurrences in a given duration.
    • All cloud activities related to a host “xyz” can be tracked using a query like “Host:xyz”. Of course, adding the time duration and VM id is possible to further drill down.
    • To troubleshoot a specific operation, users can search using the job id. For example, “job-1076 AND log_level=error” helps bring up relevant error logs like insufficient server capacity.
    • At all times, details related to each log entry can be found by enabling contextual fields.

Alerts and Events Section

Once you select a cloud, all alerts and events are available for search from a single place. You can slice-and-dice the alert/event data using the cloud infrastructure hierarchy like Zone, Pod, Cluster, Host and also other components like system VM ids, volume number and template ids. This way you can quickly check which Clusters/Hosts have higher alerts and focus your energy on troubleshooting those.

You can also query specific alerts such as “SSVM stopped unexpectedly” and understand how many times it appears for a selected time duration across different zones. Users can also narrow down by selecting specific zones. Further refining of search results is also possible using the ‘AND’ keyword with a specific SSVM id.

Similarly, events like IAMPOLICY.REVOKE can be looked up to find if instances of “ACL policy revoke” took place over a specific time period. All kinds of events like VM.CREATE and SSVM.DESTROY can be traced across the cloud and distributions across components can be found by adding cloud components in the filter criteria.

Rovius Cloud builds fully managed enterprise hybrid cloud using existing IT infrastructure that is ready to consume within minutes. Rovius Cloud is easy to install, upgrades with zero downtime and enables user to consume public cloud resources with unlimited capacity.

Recommended Posts

Leave a Comment

Start typing and press Enter to search