POSIX threading model aside for a moment... C++ will finally allow an
expert to create highly efficient portable non-blocking algorihtms that
can indeed scale up to 32 cores and beyond. [...]
However, I agree that there are very few experts that can actually create
these types of exotic algorihtms. I personally don't have a problem, and
have been creating and implementing scaleable synchronization techniques
for
years, but that puts me in a fairly narrow minority. Oh well.
Yes and no. The answer to your other remark is the killer:
I am all for NUMA models that have _very_ weak cache coherency mechanism;
AFAICT, its basically the only way to scale. Luckily, for me anyway,
C/C++
can address these architectures quite nicely.
Oh, no, they can't! That's precisely the problem.
This is a _major_ cop out, but I do indeed make heavy use of compiler and
architecture specific techniques/guarantees to get the job done. For
instance, I create most of my sensitive synchronization algorihtms in
externally assembled libraries and link them into a C program, with
link-time optimizations turned off course. So, I should of really said that
assembly language, and some specific C/C++ compilers (e.g., GCC) can be used
to address NUMA models with weak CC. You can get some degree of portability
this way, but its definitely not fully portable in any way, shape or form.
It can be a pain to port synchronization algorihtms to new architectures
because I have to rewrite all of the damn assembly language files, and then
_hope_ a C compiler that gives me the guarantees I need will be available.
Basically, if you like to juggle running chainsaws, and you have patience,
you can use C/C++ and ASM to bring great scalability, throughput and
performance characteristics to concurrent programs.
The new C++ standard
will move some way towards that - but ONLY if you use none of the C
features in C++ (including cstring and, even worse, some C++ features
that inherit their semantics from C).
Yeah. I am mostly interested in the fairly fine-grain memory barriers,
specifically the relaxed barriers and data-dependant loads, that should be
incorporated into the standard. It pleases me to know that Paul E. McKenney
is giving his advise in the development process...
The point is that there is no consistent consistency model for either
C or POSIX,
POSIX guarantees absolutely nothing if you don't use locks to guard any
access to shared data. So, if you follow the standard, it can be extremely
difficult to scale. There are some things you can do, but they have there
limitations:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a23aeb712e8dbdf9
This seems to scale better than most native POSIX rw-locks, however, the
overhead in the write access is increased. Or:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/776f6842784072f2
This allows for concurrent mutations, however it has limitations wrt
traversals:
http://groups.google.com/group/comp.programming.threads/msg/356576741aed8f06
http://groups.google.com/group/comp.programming.threads/msg/3cd613e8b72e5ace
I could easily use RCU to manage the traversal, but then I lose all sense of
portability. Therefore, I conclude that its very difficult, if not
impossible, to scale using 100% pure PThreads.
and the C++ addresses only the pure C++ aspects.
Yes; your right.
Unless
it has been vastly extended since I tried to get that aspect addressed.
I haven't been following the cpp-threads group lately. I have made some
comments on that list. Sadly, it does seem like there are a few people on
there that don't see a need for fine-grain membars; they seem to think that
sequential consistency is all that is needed because the only programmers
that would ever use fine-grain barriers are hard core thread monkeys that
are few and far between. Its good that Paul E. McKenney has seemed to
successfully convinced them that data-dependant load barriers are an
essential tool.