Asio, SSL, and scalability

16 Aug 2015

Let's say you want to build an SSL server in C++. Transmitting data is going to be a major part of your application, so you need it to be fast and efficiently use system resources, especially processor cores.

You may have heard of Asio (possibly better known as Boost.Asio). Asio is a "cross-platform C++ library for network and low-level I/O programming that provides developers with a consistent asynchronous model". It's widely used and mature, and in the future may be a part of C++17 standard library. Note that this post is not a tutorial or an introduction to Asio, but rather a study on how scalable it is in our use case, why it scales poorly and how to improve it.

Benchmarking

Asio includes SSL support using OpenSSL library. I've created an example, "naive" benchmark that sets up a server with a given number of threads, creates a given number of connections and measures the time it takes each of them to send M messages of size N. The code uses Asio 1.10.6, which is the current stable version. Here's how I compile it on OS X:

$ clang++ -std=c++11 -O3 benchmark_naive.cpp -o benchmark_naive \
          -Iasio-1.10.6/include/ -I/usr/local/opt/openssl/include/ \
          -L/usr/local/opt/openssl/lib/ -lssl -lcrypto

And here are example benchmark results on my machine (that has 4 cores with Hyper-threading, adding up to 8 virtual cores):

$ # 1 thread, 1 connection, 1000 messages * 10 MB
$ ./benchmark_naive 1 1 1000 $((10 * 1024 * 1024))
10000 megabytes sent and received in 22.392 seconds. (446.588 MB/s)
$ # 8 threads, 4 connections, 250 messages * 10 MB
$ ./benchmark_naive 8 4 250 $((10 * 1024 * 1024))
10000 megabytes sent and received in 52.821 seconds. (189.319 MB/s)

Let's put all of my results into a chart:

As you can see, the chart shows an increase in bandwidth for two threads, which we would expect even for a single connection (as one endpoint has to encrypt, and other decrypt the data). After that point, the bandwidth rapidly falls off with the number of threads, reaching about 200 MB/s when used with multiple connections. Connections can potentially run in parallel, so the ideal chart would show a linear speedup with the number of threads, capping at the ratio of 2 threads/connection.

Summarizing the experiment, Asio with OpenSSL not only does not scale with the number of threads, it actually slows down considerably when the number of concurrent operations is high enough. This suggests a heavy lock congestion. Let's check our theory in a profiler:

Most of the time of our application is spent waiting for locks, most of the locking occurs in ERR_get_state OpenSSL function, and it can most often be found in a call tree of asio::ssl::detail::engine::perform. A look at engine::perform implementation in engine.ipp quickly shows us that... everything is fine. Sure, you can maybe make a few small optimizations, but in the larger picture OpenSSL is used correctly¹². There's no error there that would result in the lock congestion. We conclude that the bottleneck in scalability lies in OpenSSL itself, more specifically in its error handling functions. So what now?

Trying BoringSSL

After Heartbleed, a few OpenSSL forks like LibreSSL and BoringSSL have appeared with a vision to trim down, modernize and secure OpenSSL's code while remaining mostly source-compatible with the original. A quick look at LibreSSL's err.c file shows it's largely unchanged from the OpenSSL's version. But BoringSSL's err.c has obviously been reworked and now uses a thread-local storage instead of locking a global mutex for data access!

As I mentioned, BoringSSL is mostly source-compatible with OpenSSL. Unfortunately, there are a few changes to make before we can use it with Asio. There are open issues in Asio GitHub repository to integrate these changes into upstream, but even then compatibility with BoringSSL is currently a moving target as the code is still being cleaned up.

Let's compile the example with BoringSSL and run our benchmarks:

$ clang++ -std=c++11 -O3 benchmark_naive.cpp -o benchmark_naive \
          -Iasio-1.10.6/include/ -Iboringssl/include/ \
          -Lboringssl/build/ssl/ -Lboringssl/build/crypto/ \
          -lssl -lcrypto
$ # 8 threads, 4 connections, 250 messages * 10 MB
$ ./benchmark_naive 8 4 250 $((10 * 1024 * 1024))
10000 megabytes sent and received in 11.591 seconds. (862.738 MB/s)

Well, that's much better. You may end reading here, knowing that as of this time OpenSSL's error handling causes it to scale poorly with the number of threads, while BoringSSL scales much better indeed. But there's still a fall-off in bandwidth when the number of threads increases, so let's try to answer that as well.

Cores, threads and io_services

If you want to find a bottleneck, profiling is usually the best answer:

Looks like Asio's internal thread synchronization is the bottleneck this time. There are two main approaches to get scalability in Asio: thread-per-core and io_service-per-core. In the "naive" example I'm using the thread-per-core approach as it always seemed more natural to me - we're basically creating a threadpool that - if needed - can dedicate one or more threads to a single connection, as opposed to io_service-per-core where each connection would be served by at most one thread. But in the light of our profiling results I've modified the example to use the second approach.

I've added a new class, IoServices, objects of which hold multiple io_services. When we need an io_service - e.g. for a new client- or server-side socket - we call ioServices.get() which will return one of the stored io_service objects on a round-robin basis.

Let's put the results into a final chart. This time we'll just focus on 8 threads, 20 connections, and directly compare our different benchmark applications. Note that I'm cheating here, if just a little bit: io_service-per-core will perform better when there are multiple connections per io_service, as it will result in a more balanced CPU load. Since we're considering a server scenario, though, 20 connections is still a very low number.

This still isn't the ideal chart (of course, due to threads' overhead no implementation can exist that would produce the ideal results), but it actually scales up with the number of used threads.

That's it for the post. Now that you know how to make Asio-based SSL server scale, you can go build a faster and safer applications.

Thanks for reading!

UPDATE 2015-08-17: While it's true that Asio uses OpenSSL API correctly, it's possible to write multi-threaded code that uses OpenSSL and which is not constrained by the error-handling bottleneck. Asio calls ERR_clear_error() before each call to SSL_*() functions, as OpenSSL documentation states "The current thread's error queue must be empty before the TLS/SSL I/O operation is attempted". To avoid the bottleneck, instead of clearing the error queue before each operation, you have to make sure to clear the queue after an error has occurred. This is something that can be done in Asio code. ↩
UPDATE 2015-08-20: The update above turned out to be incorrect, as OpenSSL will still call locking functions internally. While the solution I described above will decrease lock contention, it doesn't completely remove the bottleneck. ↩

The end of GSoC 2013

23 Sep 2013

This is a status update for my Google Summer of Code 2013 project - implementing advanced statistics importers for Amarok. Please read the first post if you would like to know more about the project.

So here came the end of Google Summer of Code 2013. It's the best moment to talk about what has been planned, what was done, and - traditionally - what the future holds. Oh, and let's get some stats and pictures (because that's what we're here for)!

Plans

I planned to:

rewrite following statistics importers based on the existing StatSyncing framework:
- Amarok 1.4 (FastForward)
- Apple iTunes
create new statistics importers for the following media players:
- Amarok 2.x
- Rhythmbox
soft goal two-way synchronization for created importers (I guess I shouldn't call them "importers" anymore, but something like "synchronization targets" just sounds too unwieldy)
softer goal write tests for created importers
side goal make new importers easy to write.

Reality

We all know that thing about plans and reality in software development, right?

©2005 Paragon Innovations, Inc. I promised you pictures.

So let's recap what I've done:

I created a framework for importers that takes care of managing them, loading and saving their configuration
- the consequence is that a user can create, remove, reconfigure new importers on a whim
- programmer can also make as many configuration options as he wants and store as much state in the importer as needed
I created following statistics importer types, loadable as plugins:
- Amarok 1.4
- Amarok 2.4
- Apple iTunes
- Banshee
- Clementine
- Rhythmbox
...and I implemented two-way synchronization for each of them
I wrote tests for every importer and all parts of the framework (except for one helper, as it's destined to be removed)
I made new importers very easy to create - and they automatically gain all goodies from the framework.

Please take a look at earlier posts if you're interested in pictures - I didn't want to overload this one with duplicates.

Statistics

Let's get the total number of changes reported by git:

konrad@KonPC-Linux ~/kde/src/amarok $ git log a214c0a2^..HEAD --shortstat \
    | grep -E "file(s)? changed" \
    | awk '{inserted+=$4; deleted+=$6} END \
        {print "lines inserted:", inserted, "lines deleted:", deleted}'
lines inserted: 21529 lines deleted: 10225

That's a nice number of changes. Here's what it actually sums up to:

konrad@KonPC-Linux ~/kde/src/amarok $ git diff a214c0a2^ HEAD --shortstat
 303 files changed, 13519 insertions(+), 2257 deletions(-)

The difference between these numbers tells us a story: at least about ⅖ of the code I wrote was removed afterwards through heavy refactoring. (Alright, maybe this wasn't much of a story.) In other words I not only overshot my initial plans, I was also making sure I'm doing it with style. ;) Of course the numbers take into account only the changes that were finally committed - there were a lot more in between.

Let's make some code size comparisons between old and new importers.

I really wanted some graphs in this post.

As you can see, the numbers are similar. Rhythmbox and iTunes importers are made bigger by XML-processing code (oh how I hate it), and Amarok 2.x and FastForward importers by custom, rich configuration widgets. The simplest importers, Clementine and Banshee, are small and pretty.

Oh, but that's not the whole story here, is it? All of the new importers also contain write capabilities - they can sync the statistics back to the foreign media player. Without it a new importer can easily fit inside 100 lines, as demonstrated in one of my previous posts. Mission accomplished.

As an aside, I find it interesting to note that the number of lines does translate very well into the number of characters. The average number of characters per line for measured code is 38.17, with standard deviation of 2.60 characters between files.

Oh, and I have proof that I was actually doing something during GSoC. Do take a look at this video that I made, if only for the amazing soundtrack (720p recommended). For details, please see the video's description.

The future

So, it's the end of Google Summer of Code, but it's not the end of the project nor my contribution to Amarok. Arguably the most important event in the project's lifetime - code review - still lies ahead. Not only that, but I already have some further refactorings in mind.

Other than the project, there's always a lot to do for Amarok, and the great community around it makes it hard to leave - so I'm not. There's just too much fun to be had. ;)

Well, that's it for the post. I'm going to take a few days off and then an academic term starts, and I'm back to my daytime job - I'll need some time to adjust my schedule. Thanks for sticking around. It's been - and continues to be - a pleasure!

GSoC Status Update – Week 11 & 12

09 Sep 2013

This is a status update for my Google Summer of Code 2013 project - implementing advanced statistics importers for Amarok. Please read the first post if you would like to know more about the project.

Last week I asked my mentor if I could skip that week's report and make the next one (i.e. this one) a double one instead. The reason for this was that I was working almost exclusively on tests, and tests more often than not make for a dull post.

Well, after two weeks I have even more tests. I'd go as far as to say that things are satisfyingly tested.

I made a base test suite that relies on convention; importers are expected to have tracks with certain metadata in their databases, and then tests make sure that this data is imported correctly. Tracks with the right metadata are pre- created and stored in the source tree so creating a test database is a matter of adding them to media player's collection.

There isn't much more to tests that's interesting, so let's skip to the next topic. GSoC 2013 is in the homestretch. September 16 is "soft pencils-down" and September 23 a "hard pencils-down" date. We're expected to have all the code done on September 16, and to spend the following week on documentation. Having that in mind, this is my much-more-detailed-than-usual plan for this week:

Monday Amarok 2.x importer two-way synchronization (done)
Tuesday FastForward importer two-way synchronization
Wednesday Banshee & Clementine importers two-way synchronization
Thursday Rhythmbox importer two-way synchronization
Friday iTunes importer two-way synchronization¹
weekend more tests! Supplement existing test suites.

Since I've been documenting as I went, the week between September 16 and 23 will be devoted to small bugfixes, design tweaks, typo fixes. Also that's the week where I remove old media-player importing capabilities and take a minute or two to celebrate.

As always, you can check out my progress on my public Amarok clone. The branch is named gsoc-importers.

Thanks for reading!

iTunes database will then have to be imported back into iTunes. But hey, it's synchronization. Kind of. ↩

GSoC Status Update – Week 10

27 Aug 2013

This is a status update for my Google Summer of Code 2013 project - implementing advanced statistics importers for Amarok. Please read the first post if you would like to know more about the project.

Fixing

Last week I fixed Amarok 2.x embedded database importer. There were quite a few problems with handling an external database process:

if the QProcess object received commands (start(), kill(), terminate()) from a different thread than that which has created it, it resulted in wrong behavior, often in the form of a crash (can't get much wronger than that)
the QProcess object does not encapsulate the process; it's only an interface. When a QProcess object was destroyed while the process was still running, it issued a warning and killed the process with SIGKILL. Normally you'd want to stop an ongoing process with a SIGTERM, so manual lifetime management is needed
if the server has stopped (which it does, after a period of inactivity), calling QSqlDatabase::removeDatabase() resulted in a SIGPIPE signal from inside the MySQL driver
related to that, an old QSqlDatabase connection silently failed to work if the server has been restarted. There would be no warnings on QSqlDatabase::open().

There was also a question of waiting for mysqld process to be ready to serve. After some research I decided to adopt an approach that MySQL startup scripts (including the "official" script, mysql.server) use, which is to wait for the server's PID file to be modified. Overall, I'm quite satisfied with the results; it's reliable and fast enough.

Creating

By the way, there's a new statistics synchronization target: Clementine.

This brings the total number of importers to six, and marks the end of implementing new importers. From now on, I intend to focus on existing code including, but not limited to, read-write capabilities and - of course - testing.

Simplifying

You may have noticed one additional target on the screenshot above: "Example." Another thing I focused on this week was code deduplication and simplifying creation of new importers. To show off, I prepared a basic "Example" importer. Below is the full C++ code. Bear in mind that aside from the code, importers need a plugin's *.desktop file and a CMakeLists.txt file. Also bear in mind that with this code the importer is already fully reconfigurable and instances of it can be created and removed at user's leisure.

#include "importers/SimpleImporterConfigWidget.h"
#include "importers/SimpleImporterManager.h"
#include "importers/ImporterProvider.h"
#include "statsyncing/SimpleTrack.h"

using namespace StatSyncing;

class ExampleConfigWidget : public SimpleImporterConfigWidget
{
public:
    explicit ExampleConfigWidget( const QVariantMap &config, QWidget *parent = 0, Qt::WindowFlags f = 0 )
        : SimpleImporterConfigWidget( "Example", config, parent, f ) {}
};

class ExampleProvider : public ImporterProvider
{
public:
    ExampleProvider( const QVariantMap &config, ImporterManager *manager )
        : ImporterProvider( config, manager ) {}

    qint64 reliableTrackMetaData() const  { return Meta::valTitle | Meta::valArtist | Meta::valAlbum; }
    qint64 writableTrackStatsData() const { return 0; }
    QSet<QString> artists()               { return QSet<QString>() << "ExampleArtist"; }

    TrackList artistTracks( const QString &artistName )
    {
        Meta::FieldHash metadata;
        metadata.insert( Meta::valArtist, artistName );
        metadata.insert( Meta::valTitle, "ExampleTitle" );
        metadata.insert( Meta::valAlbum, "ExampleAlbum" );
        metadata.insert( Meta::valRating, 10 );

        return TrackList() << TrackPtr( new SimpleTrack( metadata ) );
    }
};

AMAROK_EXPORT_SIMPLE_IMPORTER_PLUGIN( example, "ExampleImporter", i18n( "Example" ),
                                      i18n( "Example Statistics Importer" ), KIcon( "dialog-ok" ),
                                      ExampleConfigWidget, ExampleProvider )

As always, you can check out my progress on my [public Amarok clone. The branch is named gsoc-importers.

Thanks for reading!

GSoC Status Update – Week 9

19 Aug 2013

This is a status update for my Google Summer of Code 2013 project - implementing advanced statistics importers for Amarok. Please read the first post if you would like to know more about the project.

Last week was free of surprises. I applied some more polish to both the Importers framework and concrete importers themselves, deduplicated some code - just general maintenance. I had a very technical to-do list containing - mostly - very minor entries, so it's nothing write about in the post. The bottom line is: the overall quality of my project is improving.<

The list keeps gaining new entries, so hopefully I'll still be having plenty to do.

One thing that continues to give me a bit of trouble is Amarok 2.x embedded database importer. There are surprisingly many problems with managing a single server process, shared between method calls, in a multithreaded environment. QProcess class has its own set of quirks on top of that which need to be taken into account, especially when it comes to destroying the object, particularly in multithreaded environment. I've got ideas, but it's something for the next post.

In other news: Banshee importer!

It still has some issues, but nothing that won't be solved by this time next week

Alright, that's me done for the post. It's been a short one, but as I said - no major problems, no major changes. A steady march to high quality.

As always, you can check out my progress on my public Amarok clone. The branch is named gsoc-importers.

Thanks for reading!

Older Newer

Konrad's weblog

Asio, SSL, and scalability

Benchmarking

Trying BoringSSL

Cores, threads and io_services

The end of GSoC 2013

Plans

Reality

Statistics

The future

GSoC Status Update – Week 11 & 12

GSoC Status Update – Week 10

Fixing

Creating

Simplifying

GSoC Status Update – Week 9