Optimizing applications and why sampling profilers for java fail
If you pay attention to the performance of your site, you can commonly find points in your code that can be optimized to reduce run time of your application. At Tweakers.net we keep an eye on the site's performance. We (try to) find and optimize slow pages and the overall (perceived) speed of the site. And as you probably know, we did that way before Google started to use page speed as a component of their ranking
Like many players in this world, we also found parts of the code that couldn't easily be optimized much further in either the database, the php-code or the apache-webserver.
Depending on the situation we took different approaches to optimize that: We started using memcached to reduce the number of queries and processing involved with fetching common data. We introduced a message queue in our set-up, allowing asynchronous execution of certain parts of the code. We installed reverse caching proxies, completely eliminating most repeated php-processing involved with semi-dynamic images. And we started using our "pricewatch engine", a internal "service" written in Java to exploit its better performance, more advanced memory management and ability to keep lots of data efficiently in memory.
All those ideas came from some form of profiling. A lot of the optimizations involved reduce or replace queries to our mysql-databases. Choosing which to optimize is mostly a matter of digging through logs, common sense and earlier optimization experiences. That reasonably works well, but only gets you so far. The tools, specifically designed for this, I've used so far for identifying which queries require optimization didn't seem to help much. It is very common for us to see a query being fast in most cases, but with some parameters being relatively slow. Apart from that, we see occasional, but major, slowdowns in queries that are normally very fast. Both situations aren't handled too well by such generic tools.
For our php code it is the same story. Its mostly common sense and very raw profiling using timers. A profiler like xdebug simply introduces way to much overhead (i.e. the "observer effect") to be used in production. Some other tools have evolved nowadays, but they're normally quite costly and I'm not very convinced they help us much in identifying actual bottlenecks. That's mainly because of the simple notion that most of the php-time is spent waiting on resources. In the cases where php was actually a hot spot, it was normally relatively easy to find. The ones we didn't find are probably not too important, as we've no real complaints about our website's performance.
We've also rewritten parts of our code in Java. A Java server application normally has much better performance and the ability to cache large amounts of data with very low latency. I.e. serializing and unserializing data to and from memcached into php is very costly compared to what we do in Java. That is one of the reasons our category pages in the pricewatch are still much faster than before the switch, even though we added several new features and the amount of products and specifications has grown over time.
Finding hot spots in Java code is much harder than in php. Due to compilation and the required restarting of services the overhead of introducing a timer somewhere is very high compared to a properly set up php test site. And to make matters worse, by restarting and/or introducing timing code you may impact the performance of your code much more than with php or sql; the JVM's hot spot optimization in Java is not very easy to predict for normal developers.
Luckily, for Java there are a bunch of profilers available that can be connected to (live) programs to introduce light weight collection of timing data. There are basically two types; sampling profilers and instrumenting profilers. Sampling profilers simply wait a bit, take a snapshot of the call trees and do the same thing again untill stopped. Instrumenting profilers use Java's dynamic code abilities to alter methods of selected classes to add some tiny bit of data collection code at the start and end of a method body.
The instrumentation variant has a major drawback; it introduces a huge overhead that severly impacts the way the JVM will do its work. Basically you're simply profiling a different application. To prevent this overhead, most profilers allow you to select which classes and/or methods should get instrumentation code and which not. This has the advantage of having much less overhead, but may prevent you from detecting a actual hot spot.
The sampling profiler does not have this disadvantage. The only disadvantage most people are aware of is that it may not sample often enough to actually capture very short method calls. But given enough time and a sufficiently small sampling interval, basic statistics suggests that shouldn't be much of a problem. As I can simply re-play the access logs (40M entries currently readily available for me) from our Tomcat application servers, its easy to pound a test environment with hours worth of requests.
Still, the sampling profilers I used didn't necessarily produce hot spot lists I agreed with. And I didn't really understand why. After many thousands of requests to Tomcat, you'd expect it to be able to collect numerous samples and thus easily pinpoint actual hot spots. But it doesn't always work that way.
According to this study it has to do with the fact that profilers don't really sample at the given intervals and at the same time should more randomly distribute the intervals. The intervals are more of a guide for when to take a sample, the actual samples seem to be taken at the next "yield point" (i.e. a point in the executable code at which the JVM allows preemptive switching to another thread). Ironically, those yield points are optimized by the JVM for better performance, potentially keeping them out of the hot spots in your code... The randomization issue also has to do with the yield points. When you switch at an exact interval, you are in sync with any other regular interval in the JVM or OS including thread schedulers. And, annoyingly, even the sampling profilers introduce a bit of an observer effect to the application (they do have to collect the data and such), albeit a much smaller one than those for the instrumentation profilers.
Given all that, you can't really trust the current Java profilers (I know of) too well. This makes it actually quite hard to optimize details in your code as you may not really know which methods are really consuming your cpu time. The general parts of your code (higher up the call-tree) are probably identified mostly correctly, so you will still get a sense of where to look.
I don't really have a solution for any of these issues, but I'd appreciate anyone's opinion or hint for proper tooling on any of the platforms discussed here. For now, its mostly still common sense and a bit of an art to optimize a application
I've seen now that optimizing an application is just all about using the right data structures and use as less concurrency or threading as possible to avoid locking and other evil stuff.
Moving more towards scala (contributing to akka framework) and using rabbitmq amqp messaging (built in erlang) I became a big fan of the actor programming model. It doesn't necessarily solve your performance problems of course, but a least it avoids loads of them.
For MySQL its mostly been the Query Analyzer from the enterprise license which is just only useful for a broad idea of what queries you run... Apart from that I've had looks at just logging queries and analyzing the log-file (with mysql's cli tool and a custom one). But its all a bit fiddly and cumbersome to actually find the queries that need attention.
For PHP I've actually only used xdebug combined with wincachegrind, but xdebug introduces a lot of overhead if its not profiling... Luckily just calling microtime in strategic parts helps a lot and is relatively easy to place. And if it is, its even worse. So that's not really comparable to the normal situation from the start.
My optimizations on Java have indeed mostly included simplifying datastructures (like using Trove's TIntHashMap rather than a normal HashMap with Integers). But those kind of optimizations actually paled by simply reducing the amount of computation required to output "php serialized content" (which was much faster in my case than decoding json into actual objects on the php-side). By switching to a custom built "string buffer" like class that outputs bytes with as little size-checks and array copies as I could manage and outputting igbinary format, I was able to reduce the "output fase" from 90% to a more reasonable 30-50% of the time
Well, that is to say, if i can trust the output of the profilers of course
[Comment edited on Thursday 11 November 2010 23:12]
I'm a big fan of async stuff. I agree with momania..
I've used several profilers in the past, and it really depends on what kind of profiling you're doing.
How I normally profile stuff is:
1) log request times
2) log database time per request
3) log time per query
4) this leaves "real" code / cpu / mem profiling here
Number 4 is actually almost always the most difficult to identify. Many tools are pretty bad at reporting hotpots. I often walk the hotspot tree to get an idea of in which methods the most time is spent, and look for call trees to that method.
After that, I add some logging hete and there to make sure.
If you guys need help identifying some performance problems, let me know.. I'd be happy to take a look!
Keep all appends() in an array, until you really need to concatenate.
That will prevent a lot of resizes, and copies.
Also, I'd like to see some results of larger use of multithreading. Let me elaborate.
Let's say you have a request where you need 5 queries and something from the cache.
r1 = query1();
r2 = query1();
r3 = query1(r2);
r4 = query1();
r5 = query1(r1);
r6 = fetch();
Now... you could have all of this be executed in threads / fibers, and block / join at points where you need to have results (at r2, and r5). I have some prototype implementations of this somewhere.. It would allow you no only to take advantage of multiple threads, but decrease the responsetime of requests.
Ofcourse, there are some problems in this scenario related to multithreading.
Ah.. found it..
I really need to move the old posts sometime
[Comment edited on Thursday 11 November 2010 23:49]
For most queries though we just do it synchronous as that's much easier to implement. And normally, we need the results of earlier queries to make some decisions and/or enhance some other query. Besides, PHP makes it a bit hard to implement it right, apart from the pecl memcached (the new one with a d) library, I don't know of any library that does that automatically in a useful manner.
So it would be interesting to asynchronously execute queries, fetch results from memcached and do the http-requests for the services. But I'm not sure we can really implement it well enough to get much out of it. I did create one explicit asynchronous service-request lately, the recommended products are independent of other queries for the pricewatch' index page and take quite a bit of time to process. So the request is issued before most other elements are processed and than the result is requested when its needed to append the display of the products to the html.
Maybe I'm missing the point (and this isn't specific to the whole profiling issue) but have you tried JRebel?
No, you're not missing the point And although I've seen JRebel earlier, I've never really tried it. Have you? Luckily my redeploy is normally less than a minute with the integrated tomcat in eclipse.michiel_ wrote on Friday 12 November 2010 @ 09:45:
Maybe I'm missing the point (and this isn't specific to the whole profiling issue) but have you tried JRebel?
[Comment edited on Friday 12 November 2010 09:58]
It also has some other reloading functionality (spring etc)
Here are some articles on sampling I have wrote.
Now compare this with our strategy based metering.
I'm mostly trying to figure out which part of my code is taking time. It can indeed be usefull to measure in detail in production as its the only perfectly realistic environment you'll have.
But I do this in my development environment since its just too complicated to set up the required monitoring on (currently) four different servers that are not directly available from my work locations.
An advantage of that is, that it doesn't really matter if there is any overhead and how high it is. The only thing that really matters is that the overhead should not change how "actionable" the results are. I.e. if you're presented with method X being a hot-spot, optimizing that method should indeed result in less time consumed for your application.
From your blog-pages (and your other responses like these to other blogs) I don't really see how usefull your tool (opencore) actually is for that kind of usage. Only that you show it has very little overhead. I don't see any gui (well, JXInsight has a gui, but that's a bit expensive for an application that is already "fast enough"), only xml-dumps and log-lines. And for a quick assessment (my application has close to 2000 methods) of which methods are hot-spots or not, that's not particularly usable.
Sorry but unless you have HUGE low hanging fruit overhead does indeed matter whether its is production or on a workstation. You will end up running down the wrong rabbit hole. The actionable approach does not actually scale outside of desktop type development and especially in a runtime with non-deterministic behavior. Actionable does still require a focus on overhead reduction.
Not sure I follow the logic in the last paragraph. I was not selling a product. Its not about "fast enough". Our customers purchase the product because they can't afford to be blind in production especially when something goes wrong. Most of our customers would already be deploying pretty low latency systems into production (b/c they use our tooling in dev & test as well) so why do you think they still need OpenCore/JXInsight in production? The want access to production activity monitoring in real-time, they want to minimize risk (turn-aound time) when a problem arises, they want to manage capacity in a dynamic environment, they want to deploy resource management supervisory routines to protect their systems, they want to fairly and accurately assign charge backs, they want to detect changes in behavior across releases,......... "fast enough" is not a reason not to perform production monitoring unless its a desktop app and workload is know and restricted to a single user.
By the way it is pretty easy to write XML transformations. In addition we already ship reporting like plugins such as top for metering & metrics.
My response was in relation to your disparaging of instrumentation/event based profilers. Purely technical. Not sure why you felt the need to bring cost into this.
Our environment is a bit too small to need any of the reasons you describe for OpenCore. So although I do want to know which parts of the code are responsible for the largest bits of the runtime, the need to continuously monitor the performance and act on that doesn't really exist. Simply because such problems haven't existed yet with the java-sections of our environment over the last few years its in use. And since the java applications are relatively independent of other systems and change relatively little, I don't really expect them to occur as well, while at the same time their performance doesn't change either.
Having said that, monitoring won't hurt either, but that just wasn't the type of profiling I was looking for and/or described in this blog.
The reason to bring cost into the discussion is that my applications are appearantly what you dismiss as more of a desktop class of applications. And yes, if I allocate 2-4GB of memory to my JVM, it can indeed run on my desktop. So I'd like to see what happens in the applications, make some changes, retest etc. That is best helped with a tool that offers a quick look at the performance. And tools like Yourkit or JProfiler offer such quick looks.
A "top"-tool for OpenCore's metering & metrics might offer that as well, but from the linked resources I couldn't find such a tool. So I assumed that'd require the complete package, JXInsight. Given the difference in pricing (and pricing model), for that specific need, JXInsight is simply too expensive for me/my company. Hence, bringing cost to it.
You have demonstrated in your blog that instrumentation can have little overhead. But I'm not really sure if you only instrument the longer methods, or also the more trivial ones. For instance, if using a HashMap happens to be a major contributor to the runtime, it may be usefull to see that 'put' (or even 'transfer' inside put) requires a lot of time, rather than the methods that - amongst others - issues the put's.
Still, by definition, instrumenting/event based profilers adds overhead to every instrumented method. Skipping certain methods may yield usable results, but may also mask the real culprit and only indicate roughly where to look
Actionable => Strategy Based => Our costing runtime is cost aware itself
More importantly OpenCore (JXInsight) scales in terms of the problem's complexity, domain and level of abstraction.
This is not a sales pitch. But I am very proud of what we have achieved. I don't believe there is anything more advanced or more beautifully designed available today.
Is it for everyone? No. In fact many times we have instructed companies to run with other more primitive approaches or solutions when the team itself is not experienced enough or has a narrow remit in terms of the data collected usage.
Again my intention was to rebut your original statement on the high overhead of instrumentation/event based profilers in comparison with sampling based ones which I would argue in this day & age of hundreds of threads and very deep call stacks and on-exception reporting are inadequate to the task at hand.
Comments are closed