Archive for June 2005

How not to load-balance connections with accept()

It is a fairly well know that multiple threads in multiple processes can wait for TCP/IP connections on the same listening socket. What is not so well known are the nuances of implementation which affect the load balancing of such connections across processes and threads.
Here I’m going to share what I learned whilst working on improving the scalability of a popular 3rd party software product (which shall remain nameless).

Let’s not worry too much about how multiple processes come to be sharing the same listening socket (this could easily be achieved by inheritance across
fork(2)
or using
ioctl(2)
with I_SENDFD (see
streamio(7I))). Instead, let’s consider how connections are allocated to threads in such a situation.

The Relatively Easy Way

Threads blocking in
accept(3SOCKET)
join socket-specific queue in “prioritised FIFO” order. This is almost exactly what we want for round-robin allocation of connections!

However, it does pose a challenge when a thread “just happens” to have a low priority when it arrives at the queue. If having a low priority is a relatively rare occurrence, this could mean that threads thus afflicted suffer starvation (i.e. if there is always one or more higher priority threads queued whenever a connection arrived).

In the olden days this was a real issue for threads in the TS scheduling class. Of course, one solution would be to use a fixed priority scheduling class (such as FX).
However, what was needed was for threads sleeping in “prioritised FIFO” queues to have their priority and queue position recalculated on a regular basis (in the same manner as runnable threads waiting for a CPU).

The bug 4468181 low priority TS threads on a sleep queue can be victimised was first fixed in Solaris 9. The fix has since been made available for Solaris 8, Solaris 7, and Solaris 2.6 via patches, but in these cases it has to be enabled by adding the following to /etc/system:

set TS:ts_sleep_promote=1

The fix is enabled by default in Solaris 9 onwards, and once it is in place connections to a specific listening socket will be allocated to threads on a near round-robin basis.

The Extremely Hard Way

For whatever reason (“it seems to work ok on AIX” is not a good one!) some folk like to put their listening sockets into non-blocking mode, and then to wait for connections in
select(3C)
or
poll(2)
– it is important to note that in Solaris, select() is implemented using poll() in any case.

This is all fine and dandy, except that Solaris uses LIFO for the poll sleep queues. Whilst this suits thread worker pools very well (because it helps keep caches warm) it is precisely not what one needs for the round-robin load-balancing of connections between threads.

Of course, the “prioritised FIFO” queueing of accept() and the LIFO queueing of poll() is either not specified or it is classified as “implementation specific” by the standards. Solaris does what it does because it seems to please most of the people most of the time. However, if you assume FIFO you can be in for a nasty surprise!

Actually, having multiple threads waiting in poll() for connections on the same listening socket is a pretty dumb idea. The reason is that a thundering herd will occurr each time a connection comes in.

Some implementations actually result in every thread coming right out of poll() and diving into accept() where only one will succeed. Solaris is a little smarter in that once one thread has picked up a connection, other threads still in poll() will go back to sleep.

Desparate Situations, Desperate Measures

If you have an application which is doing things the extremely hard way, then the best solution is a complete rewrite! However, if this is not an option open to you “The Harman House Of Hacks” is proud to present the following:

#define _REENTRANT
#include <sys/types.h>
#include <sys/socket.h>
#include <dlfcn.h>
#include <stdlib.h>
static int (*orig_accept)();
#pragma init(iaccept_init)
#pragma weak accept=_accept
static void
iaccept_init()
{
orig_accept = (int(*)())dlsym(RTLD_NEXT, "_accept");
}
static int
_accept(int s, struct sockaddr *addr, socklen_t *addrlen)
{
int res;
(void) poll(0, 0, drand48() * 21.0);
res = orig_accept(s, addr, addrlen);
return res;
}

This is intended to be compiled and used something like this:

$ cc -K pic -G -o iaccept.so iaccept.c
$ LD_PRELOAD=./iaccept.so brokenapp

This works by interposing a new implementation of accept() which does a short random sleep before diving into the standard implementation of accept(). This means that the thread at the head of the list socket’s sleep queue is nolonger likely to be the first thread into accept().

Conclusion

Of course, this is a rather blunt instrument which should not be applied universally — or lightly. However, although more threads in the thundering herd are likely to come all the way out of poll() this hack does a very nice job of scambling the sleep queue order. For the application for which the above hack was written, the latency added to accept() was not an issue. But more importantly, connections were very evenly balanced across processes.

Technorati Tags:
,

data != information != knowledge != wisdom != truth

Whilst reading this fun posting on the subject of Intelligent Design (surely, something we in the OpenSolaris universe know a lot about!) I was taken with the blog’s heading:

  • DATA IS NOT INFORMATION
  • INFORMATION IS NOT KNOWLEDGE
  • KNOWLEDGE IS NOT WISDOM
  • WISDOM IS NOT TRUTH

I like it because it speaks to little things (such as the blogosphere), and big things (such as
the answer to the ultimate question of life, the universe, everything).

Another posting
explains the caption’s origins. As a blog disclaimer, it is rather more useful than this:

THE INDIVIDUALS WHO POST HERE WORK AT SUN
MICROSYSTEMS. THE OPINIONS EXPRESSED HERE
ARE THEIR OWN, ARE NOT NECESSARILY REVIEWED
IN ADVANCE BY ANYONE BUT THE INDIVIDUAL
AUTHORS, AND NEITHER SUN NOR ANY OTHER PARTY
NECESSARILY AGREES WITH THEM.

A Blogging Roller-Coaster Ride!

Why do I spend so much time fighting Roller? Why can’t I just get on and do my blogging thang? My patience is wearing thin. Shouldn’t the default environment just work? Why do I have to become and HTML/CSS expert? I didn’t sign up for this!

And judging by the quality of the OpenSolaris blogs, I’m not alone in my frustation. The content is great, but the presentation is often poor.

So, in the absence of better advice I have decided on the following:

  • I will use the “Basic” theme with minimal hacks (see below)
  • I will hijack <h4> … </h4> for my own subheading use
  • I will mark all my paragraphs with <p> … </p> containers
  • I will never use <p> as a separator
  • I will spurn the use of <br> separators
  • I will employ <pre class=”code”> … </pre> for code quotes

I’ve modified the “Basic” theme’s default _css file to ensure that “h4″ headings have margins which match standard “p” paragraphs, and that preformatted code has a readable font size …

#set( $theme = "basic" )
#parse("/WEB-INF/classes/themes/css.vm")
<style type="text/css" media="print">
.rWeblogCategoryChooser{ visibility: hidden; }
.entries{ width: 100%; float: none; }
.rightbar{ visibility: hidden; }
</style>
<style type="text/css">
h4 { margin:10px; }
pre.code { font-size: 10pt; }
</style>

I didn’t like the way the “Basic” theme’s default _day file adds <p> … </p> containers around each day entry because this forces nesting of the “p” tags within the final HTML. Nor did I like the way the entry title was just marked with the “b” tag. So this is what I now use …

<div class="box">
<div class="entry">
#showDayPermalink( $day )
#showEntryDate( $day )
</div>
#foreach( $entry in $entries )
<a name="$utilities.encode($entry.anchor)" id="$utilities.encode($entry.anchor)"></a>
<h3>$entry.title</h3>
#showEntryText($entry)
<span class="dateStamp">(#showTimestamp($entry.pubTime))</span>
#showEntryPermalink( $entry )
#showCommentsPageLink( $entry )
#end
#showLinkbacks( $day )
</div>

For such an advanced blogging environment I find it absolutely incredible that I have to edit both the bookmarks database and the main weblog template jsut to get rid of the default (and useless) “News” links! Here’s my revised “Basic” theme weblog template …

<div class="entries">
<h1>#showWebsiteTitle()</h1>
<h2>#showWebsiteDescription()</h2>
#showWeblogCategoryChooser()
#showNextPreviousLinks()
#showWeblogEntries("_day" 15)
</div>
<div class="rightbar">
<h2>Calendar</h2>
<div class="sidebar">
#showWeblogCalendar()
</div>
<h2>RSS Feeds</h2>
<div class="sidebar">
#showRSSBadge()<br>
#showRSSLinks()
</div>
<h2>Search</h2>
<div class="sidebar">
#showSearchForm()
</div>
<h2>Links</h2>
<div class="sidebar">
#showBookmarks("My Sites" true false)
#showBookmarks("Blogroll" true false)
</div>
<h2>Navigation</h2>
<div class="sidebar">
#showCssNavBar()
</div>
</div>

Well, there you are. That’s what I did. I just wish it was a simple matter for others to use the above, but unless you are also using the “Basic” theme it almost certainly won’t work for you! However, the principles should be applicable to other themes.

I have also taken the precaution of downloading the current Roller distro just incase someone fiddles with the “Basic” theme in a later release. Sigh!

Caring For The Environment – Making getenv() Scale

Although a relatively minor contributor to href="http://opensolaris.org">OpenSolaris I still have the
satisfaction of knowing that every href="http://www.sun.com/software/solaris/10/">Solaris 10 process
is using my code. But who in their right mind needs href="http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8i?a=view">getenv(3C)
to scale? Of course if you don’t care about thread safety (as is
currently the case with href="http://www.gnu.org/software/libc/libc.html">glibc version
2.3.5 — and hence with Linux) your implementation might scale very
nicely thankyou!

Sun on the other hand does
care about thread safety (and we’ve been doing so for a href="http://blogs.sun.com/roller/page/pgdh/20050511#initial_thread1">long
time). However we had rather assumed that no one in their right
mind would need getenv() to
scale, so our implemetation was
made thread safe by putting a dirty great mutex lock around every piece
of code which manipulates environ.
After all, as our very own Roger Faulkner is so fond of saying: style="font-style: italic;">“Correctness is a contraint; performance
is a goal”. And who cares about getenv()
performance anyway?

But Who Really Cares?

Well it turns out that there are some significant applications which
depend on getenv()
scalability (and which scale wonderfully on Linux … where thread
safety is often ignored … they are just very lucky
that no one seems to be
updating environ whilst anyone else is reading it). So href="http://blogs.sun.com/barts">Bart Smaalders filed bug href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4991763">4991763
getenv doesn’t scale and
said he thought it was an excellent opportunity for my first putback.
Thanks Bart!

For href="http://blogs.sun.com/roller/page/pgdh/20050524#if_linux_is_faster_it">some
time I’ve been saying: “If Linux
is faster, it’s a Solaris bug!”

but somehow 4991763 didn’t
quite fit the bill. Firstly, I think an application which depends on style="font-weight: bold;">getenv() scalability is broken.
Secondly, Linux is just feeling
lucky, punk
. However, I do firmly believe
that we should do all we can to ensure that Solaris runs applications
well — even those which really need some tuning of their own. I had
also
been itching for an chance to explore opportunities for lockless
optimisations in libc, so all in all 4991763
was an opportunity not to be missed!

A Complete Rewrite

The existing implementation of getenv(),
putenv(), style="font-weight: bold;">setenv(), and style="font-weight: bold;">unsetenv() was split across three
files (getenv.c, putenv.c and nlspath_checks.c) with the global mutex style="font-weight: bold;">_libc_environ_lock being defined
elsewhere. Things had become fairly messy and inefficient so I decided
on a complete rewrite.

NLSPATH security checks had
introduced yet another global variable and a rather inefficient dance
involving a mutex on every {get,put,set,unset}env()
call just to ensure that clean_env()
was called the first time in. In this instance it
was an easy matter to remove the mutex from the fast path by doing a
lazy check thus:

static mutex_t                  update_lock = DEFAULTMUTEX;
static int                      initenv_done = 0;
char *getenv(const char *name)
{
char                    *value;
if (!initenv_done)
if (findenv(environ, name, 1, &value) != NULL)
return (value);
return (NULL);
}

The test was then repeated under the protection of the mutex in style="font-weight: bold;">initenv() thus:

extern void                     clean_env();
static void
initenv()
{
lmutex_lock(&update_lock);
if (!initenv_done || ... ) {
/* Call the NLSPATH janitor in. */
clean_env();
.
.
.
initenv_done = 1;
}
lmutex_unlock(&update_lock);
}

By rolling putenv.c into href="http://cvs.opensolaris.org/source/xref/usr/src/lib/libc/port/gen/getenv.c">getenv.c
I was able to eliminate the use
of globals altogether, which in turn allowed the compiler to produce
better optimised code.

Look, No Locks!

But the biggest part of the rewrite was to make the fast path of style="font-weight: bold;">getenv() entirely lockless. What is
not apparent above is that findenv()
is entirely lockless.

Various standards define the global environment list pointer:

extern char **environ;

This has to be kept as a consistent NULL terminated array of
pointers to NULL terminated strings. However, the standards say nothing
about how updates are to be synchronised. More recent standards forbid
direct updates to environ
itself if getenv() and friends
are being used.

Yet the requirement that environ
is kept consistent is precisely
what we need
to implement a lockless findenv().
The big issue is that whenever the environ list is updated, anyone else
in the process of scanning it must not see an old value which has been
removed, or miss a value which has not been removed.

The traditional approach is to allocate an array of pointers, with
environ pointing to the first element. When someone needs to a new
value to the environment list we simply add it to the end of the list.
But how do we go about deleting values? And what if we need to add a
new value when the allocated array is already full? If you care about
threads, it’s not long before you need to introduce some locking!

The
new implementation contains two “smarts” which meets these challenges
without introducing locks into the getenv()
path …

Double It And Drop The Old One On The Floor

When the new implementation needs a bigger environ list, it simply
allocates a new one which is twice as large and copies the old list
into it. The old list is never reused — it is left intact for the
benefit of anyone else who might happen to be still traversing it.

This may sound wasteful, but the environment list rarely needs to be
resized. The wastage is also bounded — it is quite easy to prove
mathematically that
this strategy never consumes more than 3x the space required by an
array of optimal size.

However, one teeny weeny issue with the style="font-style: italic;">“drop it on the floor” approach is
that leak detectors can get a tad upset if they find allocated memory
which is not referenced by anyone. With a view to keeping me on the
straight and narrow — but mostly to avert high customer support call
volumes — Chris
Gerhard
recommended that I keep a linked list of all internally
dropped memory (just to keep those goody-goody leak detectors happy).

I first met Chris in 1989. He was on my interview panel when I
joined Sun. I do hope he feels he did a good job that day!

Overwrite With The First Entry And Increment

I was bouncing some other getenv()
ideas around with Chris when he also gave me
just what I needed for deletes. The old code just grabbed the global
lock, found the element to be deleted, shifted the rest of the list
down one slot (overwriting the victim), and then released the lock.

Chris had the bright idea of copying the first element in the list
over
the victim, and then incrementing the environ
pointer itself. The worst case would be that the same element might be
seen
twice by another thread, but this is not a correctness issue.

This led to two further changes:

  1. New values are now added at the bottom of the environment list
    (with
    the environ pointer being
    decremented once the new value is in place).
  2. When a new doube-sized environment list needs to allocated, the
    old one is copied into the top
    of the new one (instead of the bottom) so that the list can then be
    grown downwards.

OK, Not Entirely Lockless

Obviously mutex lock protection is still needed to serialise all
updates to the environment list. The new implementation has a lock for
all seasons: update_lock (e.g.
for updating initenv_done and
for protecting environ
itself). However the new getenv()
is entirely lockless (i.e. once clean_env()
has been called once).

Another important issue is that it is considered bad practice for
system libraries to hold a lock while calling style="font-weight: bold;">malloc(). For this reason the first
two thirds of addtoenv() are
inside a for(;;) loop. If it
is necessary to allocate a larger environment array style="font-weight: bold;">addtoenv() needs to drop style="font-weight: bold;">update_lock temporarily. However
this opens a window for another thread to modify environ in such a way
that means we must retry. This loop is controlled by a simple
generation number environ_gen
(also protected by update_lock).

Guaranteeing Consistency

That’s almost all there is to it. However in multiprocessor systems we
still have to make sure that memory writes on one CPU happen in such a
way that they don’t appear out of sequence on another CPU. Of course
this is taken care of automatically when we use mutex locks.

Consider the following code fragment to insert a new item:

environ[-1] = string;
environ--;

It is vitally important the two memory writes implied by this are seen
in the same order to every CPU in the system. On SPARC today this
doesn’t matter since all ABI compliant binaries run in Total Store
Order mode (i.e. stores are guarantted to be visible in the order in
which they are submitted). But is it possible that future systems will
used a more relaxed memory model.

However, this is not just a hardware issue, it is also a compiler
issue. Without extra care the compiler might reorder the two stores,
since the C language cares nothing for threads. I had quite a long
discussion with a number of folk concerning “sequence points” and the
use of volatile in C.

The eventual solution was this:

environ[-1] = string;
membar_producer();
environ--;

First, the function membar_producer()
serves as a sequence point, guaranteeing that the C compiler will
preserve the order of the preceding and following statements. Secondly,
it provides any store barriers needed by the underlying guarantee the
same effect as Total Store Order for the preceding and following
instructions.

A Spirit Of Generousity

My new implementation was integrated into style="font-weight: bold;">s10_67 yet despite my own
extensive testing it caused a number of test suites to fail in later
testing. This was
tracked down to common test suite library code which updated style="font-weight: bold;">environ directly. Yuck! Although
this kind
of thing is very much frowned on by the more recent standards it was
felt that if our own test cases did it there was a good chance that
some real applications did it too. So with some reluctance I filed style="font-weight: bold;"> href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6183277">6183277
getenv should be more gracious to
broken apps.

If someone else if going to mess with environ
under your nose there’s not a lot you can do about it. However it is
fairly easy to detect the majority of cases (except for a few really
nasty data races) by keeping a private copy style="font-weight: bold;">my_environ which can be compared
with the actual value of environ.
If these two values are ever found to differ we just go back to square
one and try again.

So the above fragment for adding an item now looks more like this:

if (my_environ != environ || ...)
initenv();
.
.
.
my_environ[-1] = string;
membar_producer();
my_environ--;
environ = my_environ;

Conclusion

My second putback integrated into s10_71.
Following this I had to fight off one other challenge from href="http://blogs.sun.com/rie">Rod Evans who filed href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6178667">6178667
ldd list unexpected (file not
found) in x86 environment. However this turned out to be not my fault, but a latent
buffer overrun in ldd(1) exposed by different heap usage patterns. Of course, the engineer who introduced this heinous bug (back in January 2001) will remain
anonymous (sort of). Still, he did buy me a beer!

Of course, the serious point is that when you change something which is used by everyone, it is possible that you expose problems elsewhere. It is to be expected that the thing last changed will get the blame. Such changes are not for the faint-hearted, but they can be a whole lot of fun!

My first experience of modifying Solaris actually resulted in two
putbacks into the Solaris gate, but I learnt a great deal along the
way. Dizzy with my success, I am now actively seeking other
opportunities for lockless optimisations in libc!

Technorati Tags:
,