Executing both branches in advance ?

Skybuck Flying · Dec 19, 2006

Hello,

It seems CPU's nowadays have prediction logic, to try and predict which
branch will be taken, and prepare/execute those instructions (pipelining
etc), if it was miss-predicted the performance suffers.

Why not execute both branches in two seperate logic units etc... and once
the real outcome is know continue with one of the two prepared branches ?

Bye,
Skybuck.

Joel Kolstad · Dec 19, 2006

Skybuck Flying said:
Why not execute both branches in two seperate logic units etc... and once
the real outcome is know continue with one of the two prepared branches ?

This is not uncommon; it's termed "speculuative execution" and has appeared in
the occasional desktop PC (the old DEC/Compaq Alpha CPU probably being the
most common -- I don't think Pentium do it).

It's not ubiquitous because -- with branch prediction already as good as it
is -- you're adding significant extra hardware to execute both paths
simultaneously and then having to cancel the path that turns out to be
"wrong." Additionally, with pipelines as deep as they are now, it probably
wouldn't be uncommon to hit another branch before your even knew the correct
result of the *first* branch, and now you're stuck with need either 4 parallel
execution units or stalling anyway. So... most people seem to think
additional hardware resources are better spent on avoiding stalls through,
e.g., scheduling multiple *threads* simultaneously (e.g., Intel's
Hyperthreading).

The comp.arch guys could give you much better answers than I can; I'm just
recalling what I learned back in school from Patteron/Hennessy's book.
(You'll definitely want to check out a copy some time... it's a pretty simple
book -- anyone with a high-school education can understand it --- but they do
a very good job of explaining how you'd go about building your own RISC CPU,
as well as why a lot of ideas that sound good on paper end up not being worth
the transistors in actuality.)

@#@#$% -- I'm noticing you already posted this to comp.arch... well, ignore
me -- as I say, they know much more about it than I do!

---Joel

Mark Brehob · Dec 19, 2006

The idea of executing down both paths (called, imaginatively enough
"dual-path execution") is actually pretty complex. Firstly, you'd only
want to do this for branches that you suspect you will get wrong. So
you need a confidence predictor (not hard to build mind you, and some
branch predictors effectively provide one). Then comes the hard part.
You need to be able to start both "threads" and kill the wrong one
later. This gets tricky to do well. Killing a somewhat arbitrary set
of

I just had a student group do a class project where they implemented a
very simple version of this and they really struggled with it. The
group had built a functional out-of-order processor (in synthesizable
Verilog) but that part was a huge amount of work and didn't really help
performance enough to offset the clock-period effects. I'm not saying
it can't be done (I suspect it can be done with a performance gain) but
this was a pretty good group of folks and they found it hard to get
right.

So the idea's been around. AFAIK, no commercial processor has done
this. And with power as the leading constraint I'm not sure anyone
will in the near-term.

Mark Brehob,
Lecturer in EECS at the University of Michigan.

Tim Wescott · Dec 19, 2006

Mark Brehob wrote:
(top posting fixed)

The idea of executing down both paths (called, imaginatively enough
"dual-path execution") is actually pretty complex. Firstly, you'd only
want to do this for branches that you suspect you will get wrong. So
you need a confidence predictor (not hard to build mind you, and some
branch predictors effectively provide one). Then comes the hard part.
You need to be able to start both "threads" and kill the wrong one
later. This gets tricky to do well. Killing a somewhat arbitrary set
of

I just had a student group do a class project where they implemented a
very simple version of this and they really struggled with it. The
group had built a functional out-of-order processor (in synthesizable
Verilog) but that part was a huge amount of work and didn't really help
performance enough to offset the clock-period effects. I'm not saying
it can't be done (I suspect it can be done with a performance gain) but
this was a pretty good group of folks and they found it hard to get
right.

So the idea's been around. AFAIK, no commercial processor has done
this. And with power as the leading constraint I'm not sure anyone
will in the near-term.

Mark Brehob,
Lecturer in EECS at the University of Michigan.

The decision should always be based on the answer to the question "now
that I've used x transistors and y watts to gain a performance gain of z
-- could I have done better spending power and real estate somewhere else?"

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google? See http://cfaj.freeshell.org/google/

"Applied Control Theory for Embedded Systems" came out in April.
See details at http://www.wescottdesign.com/actfes/actfes.html

Ancient_Hacker · Dec 19, 2006

Skybuck Flying wrote:

Why not execute both branches in two seperate logic units etc... and once
the real outcome is know continue with one of the two prepared branches ?

That could be doable to a certain extent, but it may only make sense to
do this if there absolutely is no other work that can be done for other
threads or processes. On modern OS's there's usually something else
that can be done with the extra CPU resources. You realize executing
other threads is 100% use of the other resources, executing code at
other speculative jumps is at best 50% utilization.

Anton Ertl · Dec 19, 2006

Joel Kolstad said:
This is not uncommon; it's termed "speculuative execution"

No, speculative execution is predicting the branch (or something
else), and following only that prediction. Following both sides of
each branch has been called "eager execution".

and has appeared in
the occasional desktop PC (the old DEC/Compaq Alpha CPU probably being the
most common -- I don't think Pentium do it).

No Alpha has ever performed eager execution. However, IIRC the first
Power implementation did at least fetch both sides of a conditional
branch.

Followups to comp.arch.

- anton

Dennis · Dec 20, 2006

Mark said:
The idea of executing down both paths (called, imaginatively enough
"dual-path execution") is actually pretty complex. Firstly, you'd only
want to do this for branches that you suspect you will get wrong. So
you need a confidence predictor (not hard to build mind you, and some
branch predictors effectively provide one). Then comes the hard part.
You need to be able to start both "threads" and kill the wrong one
later. This gets tricky to do well. Killing a somewhat arbitrary set
of

I just had a student group do a class project where they implemented a
very simple version of this and they really struggled with it. The
group had built a functional out-of-order processor (in synthesizable
Verilog) but that part was a huge amount of work and didn't really help
performance enough to offset the clock-period effects. I'm not saying
it can't be done (I suspect it can be done with a performance gain) but
this was a pretty good group of folks and they found it hard to get
right.

So the idea's been around. AFAIK, no commercial processor has done
this. And with power as the leading constraint I'm not sure anyone
will in the near-term.

IBM System 360/85 in 1969. When it came to a branch it would prefetch
both paths up to 16 bytes (or the next branch). When the branch was
determined it would use the result from the "right" path. It took 1
cycle to switch if the branch went to the non current prefetch buffer.
Also one of the first machines to have cache memory.

Eli · Dec 20, 2006

Skybuck said:
Hello,

It seems CPU's nowadays have prediction logic, to try and predict which
branch will be taken, and prepare/execute those instructions (pipelining
etc), if it was miss-predicted the performance suffers.

Why not execute both branches in two seperate logic units etc... and once
the real outcome is know continue with one of the two prepared branches ?

Bye,
Skybuck.

Hi all,

While many of the arguments here are correct (power, complexity, etc.)
the idea is also flawed at the theoretic level. It "may" give some
performance, but for a non-trivial reason.

All the people that proposes it make the mistake to compare to different
machines for the cases with and without eager execution (that is what we
called it in the past.) In the post above, we have the term "separate"
logic unit. If you are willing to add a logic unit, why not to use it
for the standard code as well? (I will come back to this later, so don't
kill me yet)

So, the merits of this idea must be compared when the machine back end
is given. Make it twice as big (of what?) if you want, but now use it
for both cases.
The two cases, for low confidence branches, are:
1) I predict the branch and run the predicted flow alone. It runs at the
full CPU speed.
2) I run both flows. Each have half the resources so it runs at half
speed (yes, I know, just wait)

Now, if my probability teacher was right, I expect case one to run at
full speed x branch prediction probability. If the confidence predictor
tells me a low confidence prediction is ~70% (this is usually the case,
I was told), then the performance is ~70% of full speed.
For the second case, I always run at 50% of full speed.
Hence, eager execution is a statistic loss for branches with more than
50% confidence.

Can we use it only for branches with less than 50% confidence? Well, if
you think the confidence is less than 50%, just flip the prediction.

Ok, but I made to assumptions: 1) I can make a bigger machine (double)
and get twice the speed for one flow. 2) Each flow can use 100% of what
its given, so when I run two flows, I get 50%.
In fact, these two assumptions mean that the amount of ILP in the CPU,
in the two cases, are the same. But the two flows are by definition
independent, so if I use eager execution, I rise the instruction level
parallelism. This is in fact the principle behind SMP (see Intel
Hyperthreading.) If, for example, the two flows run at 60% of full
speed, then the aggregated performance is 120%.
But this is a weak argument, since the extra parallelism doesn't give
much on real machines, and the predictors have a pretty good level of
confidence, even for weird or unknown branches.

Eli

Discalimer: I work for Intel, but this has NOTHING to do with it, and
NOTHING must be inferred from it.

Terje Mathisen · Dec 20, 2006

Eli said:
Hi all,

While many of the arguments here are correct (power, complexity, etc.)
the idea is also flawed at the theoretic level. It "may" give some
performance, but for a non-trivial reason.

All the people that proposes it make the mistake to compare to different
machines for the cases with and without eager execution (that is what we
called it in the past.) In the post above, we have the term "separate"
logic unit. If you are willing to add a logic unit, why not to use it
for the standard code as well? (I will come back to this later, so don't
kill me yet)

So, the merits of this idea must be compared when the machine back end
is given. Make it twice as big (of what?) if you want, but now use it
for both cases.
The two cases, for low confidence branches, are:
1) I predict the branch and run the predicted flow alone. It runs at the
full CPU speed.
2) I run both flows. Each have half the resources so it runs at half
speed (yes, I know, just wait)

Now, if my probability teacher was right, I expect case one to run at
full speed x branch prediction probability. If the confidence predictor
tells me a low confidence prediction is ~70% (this is usually the case,
I was told), then the performance is ~70% of full speed.
For the second case, I always run at 50% of full speed.
Hence, eager execution is a statistic loss for branches with more than
50% confidence.

This is _totally_ bogus!

You are completely disregarding the cost of restarting the pipeline each
time you've mispredicted a branch. Let's take a simple example using a
current Intel cpu, the Core 2:

A branch miss cost about 15 cycles, so if we have a branch that goes 80%
f the time in one direction, and 20% the other, those 20% will
mispredict, right?

Just to take a small exmple, the core operation when decoding
HD-TV/HD-DVD/Blu-Ray is a context-adaptive binary arithmetic decoder,
where the least probable symbol occurs something like 20-30% of the time.

The reference C code looks like this:

bit = state & 1;
if (offset < range)
{
state = transMPS[state];
}
else
{
offset -= range;
range = range_lps;

bit ^= 1;
state = transLPS[state];
}

I.e. the short path is the most likely, but the other path is only three
instructions longer.

Taking a branch miss every 5 iterations corresponds to a cost of 3-5
cycles/iteration, so if we can run both halves in parallel and merge the
results in less than this time, then it will be a clear win.

According to the docs of the VLC media player, the latest ffmpeg decoder
does use conditional move operations to remove the branch from the code
above, simply because doing so is a win on nearly all current cpus.

Having hardware available to do the same thing would be a much clearer
win, since hw could presumably handle the cancellation of the non-taken
path in not more than a cycle or two, right?

With 5-6 execution units in a Core 2 cpu, including one load unit, you
could run all the operations in the long path in a single cycle, so even
if it is four times as long, it has the same latency as the short path!

It is also important to note that this code has a lot of sequentially
dependent operations, so most of the time nearly all those execution
units are idle. :-(

Can we use it only for branches with less than 50% confidence? Well, if
you think the confidence is less than 50%, just flip the prediction.

Ok, but I made to assumptions: 1) I can make a bigger machine (double)
and get twice the speed for one flow. 2) Each flow can use 100% of what
its given, so when I run two flows, I get 50%.
In fact, these two assumptions mean that the amount of ILP in the CPU,
in the two cases, are the same. But the two flows are by definition
independent, so if I use eager execution, I rise the instruction level
parallelism. This is in fact the principle behind SMP (see Intel
Hyperthreading.) If, for example, the two flows run at 60% of full
speed, then the aggregated performance is 120%.
But this is a weak argument, since the extra parallelism doesn't give
much on real machines, and the predictors have a pretty good level of
confidence, even for weird or unknown branches.

See above, making that code branchless actually does help.

Yes, eager execution is a net loss for most branches in real code, but
when you need it, you need it bad.

Terje

Nicholas King · Dec 21, 2006

Eli said:
Ok, but I made to assumptions: 1) I can make a bigger machine (double)
and get twice the speed for one flow. 2) Each flow can use 100% of what
its given, so when I run two flows, I get 50%.
In fact, these two assumptions mean that the amount of ILP in the CPU,
in the two cases, are the same. But the two flows are by definition
independent, so if I use eager execution, I rise the instruction level
parallelism. This is in fact the principle behind SMP (see Intel
Hyperthreading.) If, for example, the two flows run at 60% of full
speed, then the aggregated performance is 120%.
But this is a weak argument, since the extra parallelism doesn't give
much on real machines, and the predictors have a pretty good level of
confidence, even for weird or unknown branches.

Those 2 assumptions just don't happen in the real world. It's a
challenge to keep current machines full. One of the reasons the industry
is going multi-core is because it's hard to keep the current number of
logic units full. Then as Terje has pointed out you've also got the case
where the branch misprediction costs more than it would've to execute
both branches.

Robert Myers · Dec 21, 2006

Terje said:
Yes, eager execution is a net loss for most branches in real code, but
when you need it, you need it bad.

Leaving issues of power and transistor count aside, it's a net loss in
time only if the branch that should have been executed takes less time
than the branch that shouldn't have been executed. Even that
disadvantage could be eliminated if execution were allowed to proceed
on both branches until it is known that one branch should be killed.

It's a many-universes style of execution: at each branch, another
universe is created until someone opens the box and finds out if the
cat is dead or alive. The parallel universes don't affect one another
in any way. What am I missing?

Robert.

joseph2k · Dec 24, 2006

Ancient_Hacker said:
Skybuck Flying wrote:

That could be doable to a certain extent, but it may only make sense to
do this if there absolutely is no other work that can be done for other
threads or processes. On modern OS's there's usually something else
that can be done with the extra CPU resources. You realize executing
other threads is 100% use of the other resources, executing code at
other speculative jumps is at best 50% utilization.

Not so. It is still 100% utilization, you just have uncommitted results
floating around. Besides there is a serious overhead in thread switching.

Start with the Tomasulo algorithm (i think you already know about pipelined
execution). Speculative execution has been implemented on several
processors including some versions of SPARC and MIPS. It does bring with
an associated cost in reworking the compiler and all the libraries to work
optimally with it.

Executing both branches in advance ?

Executing both branches in advance ?

Skybuck Flying

Joel Kolstad

Mark Brehob

Tim Wescott

Ancient_Hacker

Anton Ertl

Dennis

Eli

Terje Mathisen

Nicholas King

Robert Myers

joseph2k

Similar threads