Diagnosing performance issues in production environments – Part I of II

Several reasons could be put forward for the presence of performance issues in large scale, highly distributed systems. The following factors could be identified as some of the most popular concerns.

  • Lack of understanding towards the language features.
  • Lapse of concentration during development.
  • Ineffective requirements modelling techniques resulting in vague and imprecise requirements.
  • Absence of unit tests and failure to carry out boundary value analysis during the testing phase.
  • Unrealistic schedules resulting in burnouts.

The traditional way of investigating a performance issue would predictably start with log analysis where we would have to read through a bundle of log files to find the cause of the issue. Of course this could come in easy with log analysis tools such as Splunk. But then again, that depends on the quality of logs that have been used and requires the domain knowledge of the whole application to begin with. Another option that would be considered is debugging the application to locate the flaw. This would not be possible unless we could transfer the production data or simulate them at the development environment to recreate the behavior. This would as time and resource consuming as it could get. When there several developers present in a team, it would normally be difficult for someone to pin point to the critical parts of the application, subjective to the size of the team and the number of components in concern. Performance issues could turn into a nightmare if developers who have contributed to certain vital components are not there anymore.

Considering the aforementioned problems, what is the most effective way to approach this issue? Performance metrics would be that one magical element that we had missed.

Incorporating performance metrics into the code would provide you with an idea about the performance of your code and its behavior. An abnormal behavior in the code could be spotted with ease provided that the right set of metrics have been used at the right places of your code. It is important to understand that monitoring and diagnosing are different in motives. Unlike in monitoring, we need all the metrics we could possibly have when it comes to diagnosing. The following could be some of the metrics that could be integrated into your code to better understand its behavior.

  • Timers :
    Provides you with the details of the execution time of a particular code block.
  • Counters :
    Simple counters that could be incremented and decremented to keep track of the sizes of data structures being used.
  • Gauges :
    Simplest metric of them all. Return the size of something like a       cache at frequent intervals to keep an eye on its growth in size.
  • Health Checks :
    Centralize your application health on an external dependency such   as a database connection or communication channel connection.
  • Meters :
    Get the rate of events at given intervals. This could be the amount of messages being published at a given interval from a channel.

Now that we are familiar with the basic set of metrics, how do we use them at moments of crisis? How would we identify the causes of bottlenecks with the use of these? We would be required to have a proper reporting mechanism that is comprehensive enough to help us spot abnormal behaviors in the metrics that we have included. The best option would be to feed the metrics data into a graphing platform such as Graphite or Ganglia so that we could just visit the graph and point to the misbehavior and its time of occurrence. We could also feed the data into the log file we are using to keep them all at one place. So the next time something kicks the bucket, it will only be a matter of checking the metrics graph and finding a pattern. Then reading the log entries at the abnormal behavior times and values retrieved from the metrics to see where the code has fallen short. Not only can we use metrics to diagnose performance issues, we could also use them to benchmark our application performance. This would provide an expected standard level for the testers as well as the clients.

Its amazing to think of the world of possibilities that metrics provides us with. But how do we implement all this? If the plan is to implement all this on our own, that would result in a separate side project with the need for additional resources and effort. Apart from that, there would be concerns about running the metrics code in the production environment. Should we need the capability to use the metrics code as and when required? Even if that’s the plan, wouldn’t integrating the metrics code into your business logic reduce readability and cause confusions? But then again, most of the problems in the world of programming have already been resolved. Its only a matter of finding the right solution and contriving it to our own needs! We will discuss about the list of available solutions for this in the next post!


One thought on “Diagnosing performance issues in production environments – Part I of II

  1. Pingback: Diagnosing performance issues in production environments – Part II of II | frozen cloud

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s