What is lock striping while programming

Lock-free multithreading is for real threading experts

I read through an answer Jon Skeet gave to a question and in it he mentioned the following:

For me, lock-free multithreading is for real threading experts, and I'm not one of them.

It's not the first time I've heard this, but I find very few people talk about how to actually do it when you want to learn how to write multithreaded lockless code.

So my question is that not only do you learn all about threading etc, but you also learn how to specifically write multithreaded lock-free code and what good resources there are.



Current "lock-free" implementations follow the same pattern most of the time:

  • * Read a state and make a copy of it **
  • * Change copy **
  • Perform an interlocked operation
  • If this fails, retry the process

(* optional: depending on the data structure / algorithm)

The last bit is eerily similar to a spinlock. In fact, it's a basic spinlock. :)
I agree with @nobugz on this: the cost of the interlocking operations used in lockless multithreading is dominated by the cache and memory coherency tasks that they must perform.

However, what you gain with a data structure that is "lock-free" is that your "locks" are very fine-grained . This reduces the likelihood of two concurrent threads accessing the same "lock" (location).

Most of the time, the trick is that you don't have dedicated locks. Instead, treat e.g. B. all elements in an array or all nodes in a linked list as a "spin lock". They are reading, changing, and trying to update if there has not been an update since the last read. If so, please try again.
This makes your "locking" (oh, sorry, not locking :) very fine-grained without introducing additional memory or resource requirements.
Making it finer-grained will reduce the likelihood of waiting times. It sounds great to make it as fine-grained as possible without introducing additional resource requirements, doesn't it?

Most of the fun, however, can come from making sure that the store / shop is ordered correctly.
Contrary to your intuition, CPUs can rearrange memory read / write operations - they are very intelligent, by the way: you will have a hard time observing this from a single thread. However, you will run into problems when you start multithreading on multiple cores. Your intuitions will break down: just because a statement is earlier in your code doesn't mean it will actually be executed earlier. CPUs can process instructions in an irregular order. This applies in particular to instructions with memory accesses in order to hide the main memory latency and to make better use of your cache.

Now, against the intuition, it is certain that a code sequence does not flow "from top to bottom", but runs as if there were no sequence at all - and can possibly be described as "the devil's playground". I believe it is impossible to give an exact answer as to what reorders will be happening on load / save. Instead, one always speaks in relation to May and mights and Cans and prepare for the worst. "Oh, the CPU could arrange this read process so that it occurs before this write process. So it is best to put a storage barrier here at this point. "

Matters are complicated by the fact that even this May and mights be able to differentiate between CPU architectures. It can be the case, for example, that something the in an architecture guaranteed not to happen on a different architecture happens .

To do "lock-free" multithreading properly, you need to understand memory models.
Getting the memory model and warranties correct, however, is no trivial matter, as evidenced by this story, where Intel and AMD made some corrections to the documentation that caused a stir among JVM developers. As it turns out, the documentation that the developers relied on from the start was not that precise at all.

Locks in .NET create an implicit memory barrier so you can use them safely (mostly ... see for example Joe Duffy - Brad Abrams - Vance Morrison's size on lazy initialization, locking, volatile, and memory barriers. :) ( Be sure to follow the links on this page.)

As an added bonus, you will be introduced to the .NET storage model on a side quest. :) :)

There's also an "Oldie but Goldie" from Vance Morrison: What every developer needs to know about multithreaded apps.

... and of course, as @Eric mentioned, Joe Duffy is a definite read on the subject.

A good STM can come as close as possible to a fine-grain lock and is likely to offer performance close to or on par with a handcrafted implementation. One of them is STM.NET from MS's DevLabs projects.

If you're not just a .NET fanatic, Doug Lea did a great job in JSR-166.
Cliff Click has an interesting view of hash tables that is not lock-striped based - as the concurrent hash tables of Java and .NET do - and appears to scale well to 750 CPUs.

If you're not afraid to venture into the realm of Linux, the following article provides further insight into the internals of current memory architectures and how cache line sharing can affect performance: What Every Programmer Should Know About Memory.

@ Ben made a lot of comments on MPI: I sincerely agree that MPI can shine in some areas. An MPI-based solution can be easier to think about, easier to implement, and less error prone than a half-hearted locking implementation that tries to be intelligent. (Subjectively, however, this also applies to an STM-based solution.) I'd also bet it's light years easier to get a decent one distributed Application in z. B. Erlang to spell correctly, as many successful examples suggest.

However, MPI has its own costs and problems when it comes to one single multi-core system running . In Erlang, for example, problems with the synchronization of process planning and message queues need to be resolved.
In addition, MPI systems normally implement a kind of cooperative N: M planning for "lightweight processes". This means, for example, that there is an inevitable change of context between simple processes. It is true that this is not a "classic context switch", but mainly a user-space operation that can be performed quickly. I sincerely doubt, however, that she can get under the 20-200 cycles of interlocking surgery. Context switching in user mode is certainly slower, even in the Intel McRT library. N: M planning with light processes is not new. LWPs have existed in Solaris for a long time. You have been abandoned. There were fibers in NT. They are mostly a relic now. There were "activations" in NetBSD. You have been abandoned. Linux had its own take on N: M threading. Something seems to be dead by now.
From time to time there are new competitors: for example McRT from Intel or most recently User-Mode Scheduling together with ConCRT from Microsoft.
At the lowest level, they do what an N: M MPI scheduler does. Erlang - or any MPI system - can benefit significantly from using the new UMS on SMP systems.

I think the OP's question does not relate to the merits and subjective arguments for / against a solution, but if I had to answer that, it would depend on the task: for building low-level, high-performance basic data structures, the run on a A single system With many cores , either low-lock / "lock-free" techniques or an STM, gives the best results in terms of performance and would likely beat an MPI solution in terms of performance at any time, even ironing out the above wrinkles e.g. in Erlang.
To create something moderately more complex that runs on a single system, I might choose classic gritty interlocking or, if performance is of the utmost importance, an STM.
For building a distributed system, an MPI system would likely make a natural choice.
Note that there are also MPI implementations for .NET (although they don't seem to be that active).

Joe Duffy's book:


He also writes a blog on these topics.

The trick to getting low lock programs right is at a deep level exactly to understand , which rules the memory model contains for your particular combination of hardware, operating system and runtime environment.

Personally, I'm not nearly smart enough to do proper low-lock programming beyond InterlockedIncrement, but if you're great, go for it. Just make sure you leave a lot of documentation in the code so that people who aren't as smart as you don't accidentally break one of your memory model invariants and introduce an undetectable bug.

There is no such thing as "lock-free threading" these days. It was an interesting playground for academics and the like, at the end of the last century when computer hardware was slow and expensive. Dekker's algorithm was always my favorite, modern hardware put it to the pasture. It does not work anymore.

Two developments have put an end to this: the growing inequality between the speed of RAM and CPU. And the ability of chip manufacturers to put more than one CPU core on a chip.

Due to the RAM speed problem, the chip designers had to put a buffer on the CPU chip. The buffer stores code and data that the CPU core can access quickly. And can be read and written to / from RAM much more slowly. This buffer is known as the CPU cache, and most CPUs have at least two of them. The first level cache is small and fast, the second is large and slower. As long as the CPU can read data and instructions from the first level cache, it runs quickly. A cache failure is very expensive. It puts the CPU into an idle state of up to 10 cycles if the data is not in the 1st cache, and in 200 cycles if it is not in the 2nd cache and has to be read from RAM.

Each CPU core has its own cache, they store their own "view" of the RAM. When the CPU writes data, it writes to the cache, which is then slowly flushed into RAM. It is inevitable that each core will now have a different view of the contents of the RAM. In other words, a CPU does not know what another CPU has written until this RAM write cycle is completed and the CPU updates its own view.

That is dramatically incompatible with threading. It always is you very important, like the status of another thread, when you need to read data written by another thread. To ensure this, you have to explicitly program a so-called memory barrier. It's a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up-to-date view of memory. All pending writes must be flushed into memory. The caches then need to be updated.

This is available in .NET. The Thread.MemoryBarrier () method implements a. Given that this is 90% of the work the Lock statement does (and 95 +% of the time it takes to run), simply not being ahead of the game by avoiding the tools provided by .NET and trying to implement your own.

When it comes to multithreading, you have to know exactly what you are doing. I mean, investigate all of the possible scenarios / cases that can arise when working in a multithreaded environment. Lock-free multithreading is not a library or class that we incorporate, but rather a knowledge / experience that we gain on our journey with threads.

While lock-free threading can be difficult in .NET, you can often make significant improvements to using a lock by carefully examining what needs to be locked and minimizing the locked section. This is also called minimizing the Granularity the lock referred to.

Suppose you need to make a collection thread secure. Don't just blindly throw a lock on a method that iterates over the collection as it performs a CPU-intensive task on each item. Possibly all you have to do is put a lock on to make a shallow copy of the collection. Traversing the copy could then work without a lock. Of course, this depends a lot on the specifics of your code, but I was able to fix an issue with the lock convoy using this approach.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.