Almost an author

My name has made it into a number of books, but has yet to make it onto the front cover. In addition to
directing the photo shoot for the covers of the second edition of Solaris Internals (that T2000 prototype
was never the same again), and of Solaris Performance and Tools (DTrace is child’s play, even Jon Haslam
can do it), I also merited a special mention for contributions to the chapter on the Solaris process model.

To my shame, I have, at times, referred some of my more truculent customers to the “acknowledgments”
sections of these, and of Cockcroft’s Sun Performance Tuning (second edition), with a cursory “I taught
them all they know, so shut up, and do as I say!” This always has the desired effect (some have even
asked me to sign their copy, and if you can find one I haven’t signed, it is worth a fortune)!

The book I don’t tend to mention is boohoo – a dot.com story from concept to catastrophe where I am
erroneously credited with the sale of an E10K. My actual advice was “fix your code, because as it is, it
won’t scale on an E10K” (but even that wasn’t enough to save them from disaster).

“Ambassador, you’re realling spoiling us!”

I was an OS Ambassador from before Solaris 2.0 shipped, and I still have the golden edition signed by Bill
Joy to prove it! In my biased opinion, OS Ambassadors has been the most successful of the ambassador
programmes, bringing tangible value to the field and engineering alike. Our conferences became a forum
for change, and sometimes served as a watering hole for different engineering groups, working on similar
projects in total isolation (we were excellent match-makers).

As my experience and confidence grew, I became more vocal and more of a driver. When folk like
Richard McDougall (who can forget his VxVM vs SDS coin?) moved on to other things and the original
Ambassador Group Boards were formed, I joined the leadership team. I too moved on when I joined PAE,
but I maintained “honorary ambassador” status until returning to the field.

Back in the UK, Chris Gerhard and I started “uk-solaris” as a forum for all with a technical interest
in Solaris from various field and engineering roles. At our first meeting we “treated” everyone to
Ferror Rocher “chocolates”, which provoked the famous line from one of the cheesiest TV adverts ever
(preserved for posterity here).

Putting something back

My first two putbacks into Solaris were a huge learning experience, and my respect for those who do this
kind of engineering day in, day out grew immensely. I’d recommend the experience to anyone who needs
a better understanding of the process …

  • 4991763 getenv doesn’t scale
  • 5105528 fix for 4915617 breaks simple multiprocess rwlock test case
  • 5105683 fix for 4915617 should be kinder with uncontended shared rwlocks
  • 6209711 thread error detection false positives possible with shared mutexes

libMicro: we scare because we care

In some ways, libMicro was a reaction to LMbench (which Bart Smaalders and I considered unscientific
and a pain in the neck), but we really wanted to write a useful tool which could produce compelling data
to drive improvements in Solaris. The result has exceeded our expectations dramatically. Not only has
libMicro produced data for many “Linux is faster than Solaris at xxx” bugs, but it also kick-started Sun’s
interest in the AMD Opteron processor (as well as helping the adoption of SPARC64).

libMicro also has the distinction of being one of the first open source projects at hosted under Mercurial
on the opensolaris.org collaboration website. It is still used extensively within Sun, and the code has also
proven to be a useful reference for those wanting to write multithreaded applications. Today libMicro can
be found alive and well here, and even our competitors are using it!

PRISM and the patent

Before Solaris could have large page support for program text and data, we needed a business case.
PRISM stands for Process Relocation in Intimate Shared Memory, and was my first big innovation whilst
in PAE. The idea is simple: stop the process, copy a region of small pages somewhere, unmap the source
region, remap the source region with large pages, copy the data back, and then allow the process to
continue. At the time ISM was the only source of large pages.

My first solution used the LD_PRELOAD shared library interposition technique, but quickly moved on to
LD_AUDIT interposition because this provides more fine-grained control. Operating at process startup
(with the inclusion of an optional dummy malloc() and free() to preallocate the heap before the relocation
took place), PRISM generated plenty of useful data to fuel the MPSS and Large Pages OOB projects. It
also highlighted the usefulness of local copies of readonly text and data for large scale NUMA machines.

The PRISM library helped some of our published CPU benchmark numbers, and so had to be shipped
with some versions of our compilers. This triggered the patent filing process, with my patent finally being
awarded a year or so later.

About five years later, with MPSS and Large Pages OOB in place, I revisited the PRISM idea with Shatter,
a tool to break up large pages into smaller ones. This contributed part of Nicolai Kosce’s dataspace
profiling initiative (DProfile), which was trying to understand the effect of page colouring on performance.

A brief history of threads

Before joining PAE (Performance and Availabilty Engineering), I worked with a major european database vendor on their kernel scalability (on behalf of a mutual customer, a leading media company). We were fighting limitations in an aging implementation of Sun’s pioneering two-level
thread model (something which became known as “old and broken libthread”). During one of my OS
Ambassador trips, I visited Bryan Cantrill and Roger Faulker, and discovered that Bryan had sketched
and Roger had prototyped a new implementation based on a one-level model. I then used the customer as the
business case for introducing the one level “alternate” implementation in Solaris 8 (under /usr/lib/lwp).

By the time I joined PAE, the new implementation had gained quite a reputation for fixing scalability and
stability problems with many multithreaded applications. PAE had many fans of the two-level concept,
so I found myself immediately in conflict with some of my new colleagues. But I stuck to my guns and
was able to win most of them over to the one-level model. I then worked with Roger, Bart Smaalders and
others to have make the one-level model the only implementation in Solaris 9. Part of my contribution to
this effort was to write the technical whitepaper Multithreading in the Solaris Operating Environment:

  • The original version on www.sun.com [pdf]
  • The revised version as presented at SUPerG [pdf]

This paper has become a widely quoted document of how we do multithreading, and is still relevant
today. Of course, the new thread implementation paved the way for Roger’s 1600 file putback to unify the
Solaris process model, making threads first class citizens in Solaris – something Linux may actually never
achieve!

Education

I don’t consider myself an academic, but I did manage a “Desmond” honours degree in Microelectronics
and Computing from the University of Wales, Aberystwyth. It was there that I first fell in love with BSD
UNIX (on a VAX 11/750), and there that I saw my first Sun workstation (a 2/120, although I was never
allowed to use it).

I have since maintained an active interest in the education market, because I feel it is a natural recruiting
ground for future Sun employees and customers. Indeed, I have recruited at least three people into Sun
from Aberystwyth alone. Over the years I have worked with the Universities of Aberystwyth, Bangor,
Bradford, Dundee, Durham, Leeds, Liverpool, Manchester, Oxford, St Andrews, Salford, Warwick and
York, most recently on Sysadmin day conferences in Aberystwyth and Manchester.

Job rotations

An invaluable part of my personal development at Sun has been the SE Job Rotation programme, which
was run by Barabara Hill (or “Mom” as she was affectionately known by many of us “on rotation”) …

  • Siebel scalability and load balancing (MDE, 2 weeks)
  • Oracle 7 on Solaris x86 and Windows NT (OPG, 2 weeks)
  • BaaN scalability (MDE, 9 weeks)
  • Many users project (PAE, 4 weeks)

These job rotations gave me the opportunity to learn alongside thought leaders such as Adrian Cockcroft,
Allan Packer, Bob Larson, Brian Wong, Dan Powers, Jim Mauro, Mike Briggs and Richard McDougall. They
also gave me my first real contact with Solaris engineering, and paved the way for some of the jobs I’ve
moved through over the years.

The GORB (Giga Object Request Broker)

One of the first solutions designed to make full use of the 64 CPUs and 64GB of the Enterprise 10000 …
in a single process … using threads. This was a collaborative R&D project with a large telco, which resulted
in at least one patent being filed. My role was to deliver the multithreading expertise needed to make this
fly.

Solving solitaire with a sledgehammer

I’m only including this example because it was so outrageous! A government agency was investigating
various hardware platforms for (what I guess was) HPC cryptographic applications. Obviously, they were
not sharing any of their actual code, but instead specified a number of number crunching challenges for
large scale multiprocessors. I took up the challenge of Solitaire with a 16-way E6000 and more than 8GB
of RAM.

o o o           * * *           o o o           5 6 7          5 6 7
o o o           * * *           o o o           2 3 4          2 3 4
o o o o o o o   * * * * * * *   o o o o o o o       0 1        7 4 0 1 0 2 5
o o o o o o o   * * * o * * *   o o o * o o o                  6 3 1 x 1 3 6
o o o o o o o   * * * * * * *   o o o o o o o                  5 2 0 1 0 4 7
o o o           * * *           o o o                          4 3 2
o o o           * * *           o o o                          7 6 5
Fig.1           Fig.2           Fig.3           Fig.4          Fig.5

In Figs.1-3 a “o” represents a hole, and a “*” a marble in a hole. Thus, Fig.1 shows the empty board,
Fig.2 the staring position (32 marbles), and Fig.3 the target end position. My solution was to encode
each board position as 33 bits, with 0 for a hole and 1 for a marble. Threads in a worker pool took known
board positions from a work pile and found all possible news moves, which were then added to the
work pile. Exploiting rotational and reflectional symmetry, each new board position becomes up to eight
possible board positions.

By coding the 33 bits as shown in Figs.4-5 I was able to make rotation a simple byte swap. Reflection is
harder, but only needs to be done once (since the remaining three reflections and be achieved by rotating
the first reflection). The really extravagant part was adding an 8GB char array to record board positions
that had already been seen (and which, therefore didn’t require to be explored again).

The result was a solution in just 2 seconds, with all possible board positions being found within 30
seconds. Sun hardware and my expertise in multithreading has moved on a lot in the intervening years.
I’m now itching to try the exercise again on an T5220!