Diagnosing performance issues in production environments – Part II of II

It has been a hectic few days and at last I found some time to discuss the potential solutions. In the previous post we had discussed the importance of metrics and how it can help us diagnose performance issues in production environment. We were left with the question as to whether we should implement the metrics logic on our own. The simple answer is, NO. The developer community has already come up with effective libraries that provide us with a variety of features. In this post, let’s look at the potential solutions we have for C#, Java and JavaScript languages.

The project in which I had to integrate performance metrics was Java and therefore I’ll start by discussing the Java solutions first. Once we chose to go ahead with metrics, we found out that there are numerous implementations available for metrics in Java. In order to choose a library we had to make a rational decision, for which we came up with the idea of preparing a DAR report. DAR (Decision Analysis and Resolution)  is a formal process where decisions are made using a formal evaluation process after careful consideration of the identified alternatives based on established criteria. Out of the available Java libraries, we short listed a few which were obviously having the edge over the others based on popularity, features and support.

  • Java Metrics
  • Perf4J
  • JAMon
  • Java Simon

The following criteria (In the order of the weights assigned) were drawn in order to choose the best solution from the above list of available alternatives.

  • Ability to create custom counters and timers for monitoring and measuring performance of code blocks.
  • Ability to visually evaluate the performance results which would help in finding trends and patterns.
  • Free and open source software license considering the cost and ability to do modifications as and when required.
  • Ability to enable or disable the features on demand.
  • Active support provided through mailing list or by any other means.

Java Metrics (http://metrics.dropwizard.io/3.2.0/) library easily outplayed the rest with a total of 47 points based on the criteria weights assigned. It satisfies the performance counter creation requirement with features to create meters, gauges, counters, timers and health checks. Visualization requirement is satisfied through reporting capabilities with Graphite and Ganglia. Licensing requirement is fulfilled with Apache License 2.0.Active support is available through the means of an active mailing list and dedicated Stack overflow tags. Enabling or disabling the features is not provided as a result of which we chose to wrap this library and provide it on our own! Perf4J was the second best option with 34 points.

We also had a C# component which resulted in the DAR report for the list of available C# libraries. Out of all the available libraries, the following three were shortlisted for the DAR analysis.

  • Metrics.NET (etishor)
  • Statsd
  • Serilog Metrics

The same criteria were considered and Metrics.NET was chosen ahead of the rest with 50.5 points. It satisfies the performance counter creation requirement by allowing to create gauges, counters, meters, timers and histograms.Visualization requirement is satisfied through graphite reporting along with other sources of reporting such as HTTP endpoints, influx DB and elastic search. Licensing requirement is fulfilled with Apache Software License 2.0. Unlike the Java solution, Metrics.NET provides with the disabling and enabling feature through a configuration file.This means that the metrics reporting could be completely disabled when not needed. Unlike the Java solution, support is available only through the GitHub page. The second best option we had is statsd with 41.5 points. Metrics.NET could be marked down as an equivalent implementation of the Java Metrics library.

Now let us have a look at the possible JavaScript solutions that are out there.

  • metrics
  • measured
  • node-monitor
  • appmetrics

The same set of criteria were considered and “metrics” library was easily the better option compared to the rest, with 47.5 points. It satisfies the performance counter creation requirement by providing features such as meters, gauges, counters and timers.The visualization requirement has been satisfied through the reporting capability provided through Graphite. CSV and console reporting are among the other options available for reporting of the metrics. Licensing requirements are satisfied with MIT License though configuration to enable or disable the features is not provided. Support is available only through the GitHub page. This could also be identified as an equivalent implementation of the Java Metrics library.

Anyone reading through the previous post would’ve got the doubt, “What do we do when we don’t need the metrics logic to be executed?! “. Reading through the solutions in this post, you would’ve already got the answer. Most of the libraries provide enable/disable feature through configuration and even if it is not present, we could still enable/disable it by wrapping the library to implement it on our own. Another popular question would’ve been “What about the overhead?!“. We carried out a simple test based on our application’s normal flow, with and without the metrics library integration and no visible performance issues were noticed. Besides, the library websites and GitHub pages do provide the guarantee for a very minimal overhead that could simply be ignored. Okay but still, “Won’t having this code integrated into the business logic cause any confusion and reduce readability?!“. Well that is inevitable. But you could isolate the metrics code through the usage of design patterns. Observer pattern could be an option where the business logic could act as the Subject and Metrics logic could act as the Observer.

So the next time you are starting over with a new project, just consider the world of possibilities that capturing performance metrics could present you with. No more dependence on any developer for finding bottlenecks, no need to replicate the production environment data to debug and find the issue and above all, no need to squeeze your brain to think of the critical areas where it could’ve gone wrong!

Diagnosing performance issues in production environments – Part I of II

Several reasons could be put forward for the presence of performance issues in large scale, highly distributed systems. The following factors could be identified as some of the most popular concerns.

  • Lack of understanding towards the language features.
  • Lapse of concentration during development.
  • Ineffective requirements modelling techniques resulting in vague and imprecise requirements.
  • Absence of unit tests and failure to carry out boundary value analysis during the testing phase.
  • Unrealistic schedules resulting in burnouts.

The traditional way of investigating a performance issue would predictably start with log analysis where we would have to read through a bundle of log files to find the cause of the issue. Of course this could come in easy with log analysis tools such as Splunk. But then again, that depends on the quality of logs that have been used and requires the domain knowledge of the whole application to begin with. Another option that would be considered is debugging the application to locate the flaw. This would not be possible unless we could transfer the production data or simulate them at the development environment to recreate the behavior. This would as time and resource consuming as it could get. When there several developers present in a team, it would normally be difficult for someone to pin point to the critical parts of the application, subjective to the size of the team and the number of components in concern. Performance issues could turn into a nightmare if developers who have contributed to certain vital components are not there anymore.

Considering the aforementioned problems, what is the most effective way to approach this issue? Performance metrics would be that one magical element that we had missed.

Incorporating performance metrics into the code would provide you with an idea about the performance of your code and its behavior. An abnormal behavior in the code could be spotted with ease provided that the right set of metrics have been used at the right places of your code. It is important to understand that monitoring and diagnosing are different in motives. Unlike in monitoring, we need all the metrics we could possibly have when it comes to diagnosing. The following could be some of the metrics that could be integrated into your code to better understand its behavior.

  • Timers :
    Provides you with the details of the execution time of a particular code block.
  • Counters :
    Simple counters that could be incremented and decremented to keep track of the sizes of data structures being used.
  • Gauges :
    Simplest metric of them all. Return the size of something like a       cache at frequent intervals to keep an eye on its growth in size.
  • Health Checks :
    Centralize your application health on an external dependency such   as a database connection or communication channel connection.
  • Meters :
    Get the rate of events at given intervals. This could be the amount of messages being published at a given interval from a channel.

Now that we are familiar with the basic set of metrics, how do we use them at moments of crisis? How would we identify the causes of bottlenecks with the use of these? We would be required to have a proper reporting mechanism that is comprehensive enough to help us spot abnormal behaviors in the metrics that we have included. The best option would be to feed the metrics data into a graphing platform such as Graphite or Ganglia so that we could just visit the graph and point to the misbehavior and its time of occurrence. We could also feed the data into the log file we are using to keep them all at one place. So the next time something kicks the bucket, it will only be a matter of checking the metrics graph and finding a pattern. Then reading the log entries at the abnormal behavior times and values retrieved from the metrics to see where the code has fallen short. Not only can we use metrics to diagnose performance issues, we could also use them to benchmark our application performance. This would provide an expected standard level for the testers as well as the clients.

Its amazing to think of the world of possibilities that metrics provides us with. But how do we implement all this? If the plan is to implement all this on our own, that would result in a separate side project with the need for additional resources and effort. Apart from that, there would be concerns about running the metrics code in the production environment. Should we need the capability to use the metrics code as and when required? Even if that’s the plan, wouldn’t integrating the metrics code into your business logic reduce readability and cause confusions? But then again, most of the problems in the world of programming have already been resolved. Its only a matter of finding the right solution and contriving it to our own needs! We will discuss about the list of available solutions for this in the next post!