Let's say you want to build an SSL server in C++. Transmitting data is going to
be a major part of your application, so you need it to be fast and efficiently
use system resources, especially processor cores.
You may have heard of Asio (possibly better known as Boost.Asio). Asio is a
"cross-platform C++ library for network and low-level I/O programming that
provides developers with a consistent asynchronous model". It's widely used and
mature, and in the future may be a part of C++17 standard library. Note
that this post is not a tutorial or an introduction to Asio, but rather a study
on how scalable it is in our use case, why it scales poorly and how to improve
it.
Benchmarking
Asio includes SSL support using OpenSSL library. I've created an example,
"naive" benchmark that sets up a server with a given number of
threads, creates a given number of connections and measures the time it takes
each of them to send M messages of size N. The code uses Asio 1.10.6, which
is the current stable version. Here's how I compile it on OS X:
And here are example benchmark results on my machine (that has 4 cores with
Hyper-threading, adding up to 8 virtual cores):
$ # 1 thread, 1 connection, 1000 messages * 10 MB$ ./benchmark_naive 111000$((10*1024*1024))10000 megabytes sent and received in 22.392 seconds. (446.588 MB/s)$ # 8 threads, 4 connections, 250 messages * 10 MB$ ./benchmark_naive 84250$((10*1024*1024))10000 megabytes sent and received in 52.821 seconds. (189.319 MB/s)
Let's put all of my results into a chart:
As you can see, the chart shows an increase in bandwidth for two threads, which
we would expect even for a single connection (as one endpoint has to encrypt,
and other decrypt the data). After that point, the bandwidth rapidly falls off
with the number of threads, reaching about 200 MB/s when used with multiple
connections. Connections can potentially run in parallel, so the ideal chart
would show a linear speedup with the number of threads, capping at the ratio of
2 threads/connection.
Summarizing the experiment, Asio with OpenSSL not only does not scale with the
number of threads, it actually slows down considerably when the number of
concurrent operations is high enough. This suggests a heavy lock congestion.
Let's check our theory in a profiler:
Most of the time of our application is spent waiting for locks, most of the
locking occurs in ERR_get_state OpenSSL function, and it can most often be
found in a call tree of asio::ssl::detail::engine::perform. A look at
engine::perform implementation in engine.ipp quickly shows us that...
everything is fine. Sure, you can maybe make a few small optimizations, but in
the larger picture OpenSSL is used correctly12. There's no error there
that would result in the lock congestion. We conclude that the bottleneck in
scalability lies in OpenSSL itself, more specifically in its error handling
functions. So what now?
Trying BoringSSL
After Heartbleed, a few OpenSSL forks like LibreSSL and BoringSSL have
appeared with a vision to trim down, modernize and secure OpenSSL's code while
remaining mostly source-compatible with the original. A quick look at
LibreSSL's err.c file shows it's largely unchanged from the OpenSSL's
version. But BoringSSL's err.c has obviously been reworked
and now uses a thread-local storage instead of locking a global mutex for data
access!
As I mentioned, BoringSSL is mostly source-compatible with OpenSSL.
Unfortunately, there are a few changes to make before we
can use it with Asio. There are openissues in Asio
GitHub repository to integrate these changes into upstream, but even then
compatibility with BoringSSL is currently a moving target as the code is still
being cleaned up.
Let's compile the example with BoringSSL and run our benchmarks:
$ clang++ -std=c++11 -O3 benchmark_naive.cpp -o benchmark_naive \
-Iasio-1.10.6/include/ -Iboringssl/include/ \
-Lboringssl/build/ssl/ -Lboringssl/build/crypto/ \
-lssl -lcrypto
$ # 8 threads, 4 connections, 250 messages * 10 MB$ ./benchmark_naive 84250$((10*1024*1024))10000 megabytes sent and received in 11.591 seconds. (862.738 MB/s)
Well, that's much better. You may end reading here, knowing that as of this
time OpenSSL's error handling causes it to scale poorly with the number of
threads, while BoringSSL scales much better indeed. But there's still a
fall-off in bandwidth when the number of threads increases, so let's try to
answer that as well.
Cores, threads and io_services
If you want to find a bottleneck, profiling is usually the best answer:
Looks like Asio's internal thread synchronization is the bottleneck this time.
There are two main approaches to get scalability in Asio: thread-per-core and
io_service-per-core. In the "naive" example I'm using the thread-per-core
approach as it always seemed more natural to me - we're basically creating a
threadpool that - if needed - can dedicate one or more threads to a single
connection, as opposed to io_service-per-core where each connection would be
served by at most one thread. But in the light of our profiling results I've
modified the example to use the second approach.
I've added a new class, IoServices, objects of which hold multiple
io_services. When we need an io_service - e.g. for a new client- or
server-side socket - we call ioServices.get() which will return one of the
stored io_service objects on a round-robin basis.
Let's put the results into a final chart. This time we'll just focus on 8
threads, 20 connections, and directly compare our different benchmark
applications. Note that I'm cheating here, if just a little bit:
io_service-per-core will perform better when there are multiple connections
per io_service, as it will result in a more balanced CPU load. Since we're
considering a server scenario, though, 20 connections is still a very low
number.
This still isn't the ideal chart (of course, due to threads' overhead no
implementation can exist that would produce the ideal results), but it actually
scales up with the number of used threads.
That's it for the post. Now that you know how to make Asio-based SSL server
scale, you can go build a faster and safer applications.
Thanks for reading!
UPDATE 2015-08-17: While it's true that Asio uses OpenSSL API
correctly, it's possible to write multi-threaded code that uses OpenSSL
and which is not constrained by the error-handling bottleneck. Asio calls
ERR_clear_error() before each call to SSL_*() functions, as OpenSSL
documentation states "The current thread's error queue must be empty
before the TLS/SSL I/O operation is attempted". To avoid the bottleneck,
instead of clearing the error queue before each operation, you have to
make sure to clear the queue after an error has occurred. This is
something that can be done in Asio code. ↩
UPDATE 2015-08-20: The update above turned out to be incorrect, as
OpenSSL will still call locking functions internally. While the solution I
described above will decrease lock contention, it doesn't completely
remove the bottleneck. ↩
This is a status update for my Google Summer of Code 2013 project -
implementing advanced statistics importers for Amarok. Please read the
first post if you would like to know more about the project.
So here came the end of Google Summer of Code 2013. It's the best moment to talk
about what has been planned, what was done, and - traditionally - what the
future holds. Oh, and let's get some stats and pictures (because that's what
we're here for)!
rewrite following statistics importers based on the existing StatSyncing
framework:
Amarok 1.4 (FastForward)
Apple iTunes
create new statistics importers for the following media players:
Amarok 2.x
Rhythmbox
soft goal two-way synchronization for created importers (I guess I
shouldn't call them "importers" anymore, but something like "synchronization
targets" just sounds too unwieldy)
softer goal write tests for created importers
side goal make new importers easy to write.
Reality
We all know that thing about plans and reality in software development, right?
So let's recap what I've done:
I created a framework for importers that takes care of managing them, loading
and saving their configuration
the consequence is that a user can create, remove, reconfigure new importers
on a whim
programmer can also make as many configuration options as he wants and store
as much state in the importer as needed
I created following statistics importer types, loadable as plugins:
Amarok 1.4
Amarok 2.4
Apple iTunes
Banshee
Clementine
Rhythmbox
...and I implemented two-way synchronization for each of them
I wrote tests for every importer and all parts of the framework (except for
one helper, as it's destined to be removed)
I made new importers very easy to create - and they automatically gain all
goodies from the framework.
Please take a look at earlier posts if you're interested in pictures - I didn't
want to overload this one with duplicates.
Statistics
Let's get the total number of changes reported by git:
The difference between these numbers tells us a story: at least about ⅖ of the
code I wrote was removed afterwards through heavy refactoring. (Alright,
maybe this wasn't much of a story.) In other words I not only overshot my
initial plans, I was also making sure I'm doing it with style. ;) Of course
the numbers take into account only the changes that were finally committed -
there were a lot more in between.
Let's make some code size comparisons between old and new importers.
As you can see, the numbers are similar. Rhythmbox and iTunes importers are made
bigger by XML-processing code (oh how I hate it), and Amarok 2.x and FastForward
importers by custom, rich configuration widgets. The simplest importers,
Clementine and Banshee, are small and pretty.
Oh, but that's not the whole story here, is it? All of the new importers also
contain write capabilities - they can sync the statistics back to the foreign
media player. Without it a new importer can easily fit inside 100 lines, as
demonstrated in one of my previous posts. Mission accomplished.
As an aside, I find it interesting to note that the number of lines does
translate very well into the number of characters. The average number of
characters per line for measured code is 38.17, with standard deviation of 2.60
characters between files.
Oh, and I have proof that I was actually doing something during GSoC. Do take
a look at this video that I made, if only for the amazing soundtrack (720p
recommended). For details, please see the video's description.
The future
So, it's the end of Google Summer of Code, but it's not the end of the project
nor my contribution to Amarok. Arguably the most important event in the
project's lifetime - code review - still lies ahead. Not only that, but I
already have some further refactorings in mind.
Other than the project, there's always a lot to do for Amarok, and the great
community around it makes it hard to leave - so I'm not. There's just too much
fun to be had. ;)
Well, that's it for the post. I'm going to take a few days off and then an
academic term starts, and I'm back to my daytime job - I'll need some time to
adjust my schedule. Thanks for sticking around. It's been - and continues to be
- a pleasure!
This is a status update for my Google Summer of Code 2013 project -
implementing advanced statistics importers for Amarok. Please read the
first post if you would like to know more about the project.
Last week I asked my mentor if I could skip that week's report and make the next
one (i.e. this one) a double one instead. The reason for this was that I was
working almost exclusively on tests, and tests more often than not make for a
dull post.
Well, after two weeks I have even more tests. I'd go as far as to say that
things are satisfyingly tested.
I made a base test suite that relies on convention; importers are expected to
have tracks with certain metadata in their databases, and then tests make sure
that this data is imported correctly. Tracks with the right metadata are pre-
created and stored in the source tree so creating a test database is a matter of
adding them to media player's collection.
There isn't much more to tests that's interesting, so let's skip to the next
topic. GSoC 2013 is in the homestretch. September 16 is "soft pencils-down"
and September 23 a "hard pencils-down" date. We're expected to have all the code
done on September 16, and to spend the following week on documentation. Having
that in mind, this is my much-more-detailed-than-usual plan for this week:
weekend more tests! Supplement existing test suites.
Since I've been documenting as I went, the week between September 16 and 23 will
be devoted to small bugfixes, design tweaks, typo fixes. Also that's the week
where I remove old media-player importing capabilities and take a minute or two
to celebrate.
As always, you can check out my progress on my public Amarok clone. The branch
is named gsoc-importers.
Thanks for reading!
iTunes database will then have to be imported back into iTunes.
But hey, it's synchronization. Kind of. ↩
This is a status update for my Google Summer of Code 2013 project -
implementing advanced statistics importers for Amarok. Please read the
first post if you would like to know more about the project.
Fixing
Last week I fixed Amarok 2.x embedded database importer. There were quite a few
problems with handling an external database process:
if the QProcess object received commands (start(), kill(),
terminate()) from a different thread than that which has created it, it
resulted in wrong behavior, often in the form of a crash (can't get much
wronger than that)
the QProcess object does not encapsulate the process; it's only an
interface. When a QProcess object was destroyed while the process was still
running, it issued a warning and killed the process with SIGKILL. Normally
you'd want to stop an ongoing process with a SIGTERM, so manual lifetime
management is needed
if the server has stopped (which it does, after a period of inactivity),
calling QSqlDatabase::removeDatabase() resulted in a SIGPIPE signal from
inside the MySQL driver
related to that, an old QSqlDatabase connection silently failed to work if
the server has been restarted. There would be no warnings on
QSqlDatabase::open().
There was also a question of waiting for mysqld process to be ready to serve.
After some research I decided to adopt an approach that MySQL startup scripts
(including the "official" script, mysql.server) use, which is to wait for the
server's PID file to be modified. Overall, I'm quite satisfied with the results;
it's reliable and fast enough.
Creating
By the way, there's a new statistics synchronization target: Clementine.
This brings the total number of importers to six, and marks the end of
implementing new importers. From now on, I intend to focus on existing code
including, but not limited to, read-write capabilities and - of course -
testing.
Simplifying
You may have noticed one additional target on the screenshot above: "Example."
Another thing I focused on this week was code deduplication and simplifying
creation of new importers. To show off, I prepared a basic "Example" importer.
Below is the full C++ code. Bear in mind that aside from the code, importers
need a plugin's *.desktop file and a CMakeLists.txt file. Also bear in mind that
with this code the importer is already fully reconfigurable and instances of it
can be created and removed at user's leisure.
This is a status update for my Google Summer of Code 2013 project -
implementing advanced statistics importers for Amarok. Please read the
first post if you would like to know more about the project.
Last week was free of surprises. I applied some more polish to both the
Importers framework and concrete importers themselves, deduplicated some code -
just general maintenance. I had a very technical to-do list containing - mostly
- very minor entries, so it's nothing write about in the post. The bottom line
is: the overall quality of my project is improving.<
The list keeps gaining new entries, so hopefully I'll still be having plenty to
do.
One thing that continues to give me a bit of trouble is Amarok 2.x embedded
database importer. There are surprisingly many problems with managing a single
server process, shared between method calls, in a multithreaded environment.
QProcess class has its own set of quirks on top of that which need to be taken
into account, especially when it comes to destroying the object, particularly
in multithreaded environment. I've got ideas, but it's something for the next
post.
In other news: Banshee importer!
Alright, that's me done for the post. It's been a short one, but as I said - no
major problems, no major changes. A steady march to high quality.
As always, you can check out my progress on my public Amarok clone. The branch
is named gsoc-importers.