Kids can do math better than x86 cpu's.

Michael A. Terrell · Mar 12, 2007

JimKeo said:
Terje,

Your x86 BCD reference jogs my memory back to 80's on BYTE
Magazine's BIX (Byte Information eXchange) teleconferencing system
(cosy).

Some of us processor-bigots, in the CPU or CPUs confernece, had an
informal my-cpu-is-better-than-yours speed competition involving
multiplication of 72-bit unsigned binary integers. The 9 bytes were
picked, in theory, so as not to give arbitrary advantage to cpus of
particular word sizes.

My entry for Apple /// (8-bit cpu Mostek 6502B)

The 6502 family was made by MOS Technology, not Mostek.

--
Service to my country? Been there, Done that, and I've got my DD214 to
prove it.
Member of DAV #85.

Michael A. Terrell
Central Florida

joseph2k · Mar 12, 2007

Terje said:
Richard said:

[Please do not mail me a copy of your followup]

Terje Mathisen <[email protected]> spake the secret code

....which makes it perfect for reconfigurable computing.
Not really.

The critical building block, as Nick have already mentioned, is the
NxN->2N multiplier (and/or MAC/FMAC), which means that you don't need a
lot of uncommitted gates, but instead a lot of multipliers and enough
(reconfigurable) routing between them to setup the required operations.

Click to expand...

I don't understand your objection. You can implement multipliers in
FPGAs, plenty of people have.

Click to expand...

Not really.

Using general FPGA cells to implement a multiplier will be _far_ from
efficient compared with a dedicates slice of silicon.

Bignum math is either trivial (no efficiency/speed requirements), or
critical.

In the latter case FPGAs are unsuited.

For arbitrary precision arithmetic you
might find a bit-serial approach easier to implement on an FPGA. That
means you can probably have several multipliers running concurrently.

Click to expand...

See above.

I guarantee you that this will be significantly slower than quite simple
x86 code using the basic 64x64->128 bitMUL opcode.

Terje

Please consider this as a possible alternative view:
directreadout.noaa.gov/miami04/ docs/weds/Receiver_Technology.pdf

It is quite in line with the things i have been reading for some years.

Please also look into software defined radio, and "Cell" processor
supercomputer.

joseph2k · Mar 12, 2007

which is basically how humans would do it; digit by digit

-Lasse

Try Booth's algorithm and fully parallel multipliers. They have been
combined for even higher hardware performance, at the cost of lots of
silicon real estate.

joseph2k · Mar 12, 2007

Jim said:
They're back to teaching "casting out 9's" in the schools here.
Except they gave it some different name to confuse grandparents ;-)

...Jim Thompson

Ant not one of them understand the implications of R's representations and
R's-1 representations.

joseph2k · Mar 12, 2007

Skybuck said:
Kids don't do math with multipliers.

Kids do math with lookup table for the digits (last digit is X=10)

Usually called math tables I guess:

1*1 = 1
1*2 = 2
1*3 = 3
1*4 = 4
1*5 = 5
1*6 = 6
1*7 = 7
1*8 = 8
1*9 = 9
1*X = 10

2*1 = 1
2*2 = 3
2*3 = 6
2*4 = 8
2*5 = 10
2*6 = 12
2*7 = 14
2*8 = 16
2*9 = 18
2*X = 20

3*1 = 3
3*2 = 6
3*3 = 9
3*4 = 12
3*5 = 15
3*6 = 18
3*7 = 21
3*8 = 24
3*9 = 27
3*X = 30

4*1 = 4
4*2 = 8
4*3 = 12
4*4 = 16
4*5 = 20
4*6 = 24
4*7 = 28
4*8 = 32
4*9 = 36
4*X = 40

5*1 = 5
5*2 = 10
5*3 = 15
5*4 = 20
5*5 = 25
5*6 = 30
5*7 = 35
5*8 = 40
5*9 = 45
5*X = 50

6*1 = 6
6*2 = 12
6*3 = 18
6*4 = 24
6*5 = 30
6*6 = 36
6*7 = 42
6*8 = 48
6*9 = 54
6*X = 60

7*1 = 7
7*2 = 14
7*3 = 21
7*4 = 28
7*5 = 35
7*6 = 42
7*7 = 49
7*8 = 56
7*9 = 63
7*X = 70

8*1 = 8
8*2 = 16
8*3 = 24
8*4 = 32
8*5 = 40
8*6 = 48
8*7 = 56
8*8 = 64
8*9 = 72
8*X = 80

9*1 = 9
9*2 = 18
9*3 = 27
9*4 = 36
9*5 = 45
9*6 = 54
9*7 = 63
9*8 = 72
9*9 = 81
9*X = 90

X*1 = 10
X*2 = 20
X*3 = 30
X*4 = 40
X*5 = 50
X*6 = 60
X*7 = 70
X*8 = 80
X*9 = 90
X*X = X0

That should do it

Try coming up with an algorithm that uses math tables only

That's what cpu does:

Math tables:

0 * 0 = 0
1 * 0 = 1
0 * 1 = 1
1 * 1 = 1 (and carry ?? <- digit missing should have been X ? )

Bye,
Skybuck.

How about you research the 40 year old Booth's algorithm, or the 50 year old
Wallace tree method. You just might learn something about how modern
hardware actually works. I happen to know that most modern general purpose
processors use a combination of the two, specifically later x86, SPARC, and
PPC families.

Terje Mathisen · Mar 12, 2007

JimKeo said:
Terje,

Your x86 BCD reference jogs my memory back to 80's on BYTE
Magazine's BIX (Byte Information eXchange) teleconferencing system
(cosy).

I have very fond memories of BIX, I used to be [email protected], frow well
before BIX became box.com and got an internet SMTP gateway.

Some of us processor-bigots, in the CPU or CPUs confernece, had an
informal my-cpu-is-better-than-yours speed competition involving
multiplication of 72-bit unsigned binary integers. The 9 bytes were
picked, in theory, so as not to give arbitrary advantage to cpus of
particular word sizes.

My entry for Apple /// (8-bit cpu Mostek 6502B) just used the
lowest level bit bit-test/add/shifting/carry instructions. The code
first determined the 6 10-byte results from one of the multipliers
(the one on the left {smile}) when multiplied by 2, 4, 8, 16, 32, 64
and 128. This actually meant 8 results were precalculated in that
multiplying by 0 was obvious and multiplying by 1 was already given.
The 6 values were done with simple bitshifts/carry/add. Once that was
done then the other multipler was processed bit-by-bit right-to-left
and answer adjusted/accumulated accordingly just like a school child
would do in decimal on paper. Choosing which multiplier (OK,
multiplier, multiplicand) to process based on the lower number of 1
was sometimes useful.

OK.

I've written long division routines that work the same way.

For table-driven multiplication I'm partial to x^2 tables, they make it
trivial to fit at least an 8-bit wide table into fast memory.

Terje

Ken Hagan · Mar 12, 2007

2*1 = 1
2*2 = 3

I think you've just demonstrated why no kid is likely to solve that
several hundred digit problem you posted earlier, which another kind
poster solved in a few seconds using fixed sized arithmetic hardware.

Now, how much money were you offering?

Erik Magnuson · Mar 13, 2007

Terje said:
For table-driven multiplication I'm partial to x^2 tables, they make it
trivial to fit at least an 8-bit wide table into fast memory.

Shades of "Cheaper by the Dozen" where Pa Gilbreth had his kids memorize
the squares of 1 through 50 for multiplication. What scares me is that I
read the book 40 years ago and still remember that detail...

- Erik

robertwessel2@yahoo.com · Mar 13, 2007

Kids don't do math with multipliers.

Kids do math with lookup table for the digits (last digit is X=10)

Usually called math tables I guess:

(...)>
That should do it

Try coming up with an algorithm that uses math tables only

That's what cpu does:

Math tables:

0 * 0 = 0
1 * 0 = 1
0 * 1 = 1
1 * 1 = 1 (and carry ?? <- digit missing should have been X ? )

When you do multiple precision multiplication you tend to use a base
that's convenient to the primitives that the CPU offers. For a CPU
providing a 32x32->64 multiply, working in base 2**32 is usually
convenient. The values of a "digit" range from 4,294,967,295. The
only purpose of the multiply is then to provide the result of
multiplying two digits. For doing the base 10 multiplications you're
describing, kids do indeed usually learn the multiplication table by
rote. the 32x32 multiplier can be implemented by any scheme the CPU
designer chooses, but a lookup table for that size inputs would be
prohibitively large.

On the flip side, multiplication of two single binary digits is
exactly equivalent to an "and" gate. In fact you basic hardware
multiplier is just a bunch of and gates to compute each partial
product), and then a bunch of adders to add them all together.

In general you want to use as large a digit as practical, since most
of the routines will scale with the number of digits, linearly, or
worse.

Doing a multiplier bit-bit bit in software is perfectly simple,
although almost always slower than necessary. It's almost always
easier and faster to deal with one operand a byte/word/etc. at a time,
and the other operand a bit at a time, and then add a bit of basic
shifting and adding, and you're done. This is exactly the scheme used
on many 8 bit CPUs which do not have a multiplier (for example the
6502).

Using a lookup table is also not an unreasonable approach assuming you
don't have a multiplier but you do have a sufficient amount of
reasonably fast memory available. For example, the basic
multiplication algorithm on the 6502 took 200 clocks to multiply two
eight bit numbers. If you were willing to dedicate about 16KB of
memory to lookup tables (of course you never were), you could do it in
about 60 clocks (if either of those clock counts are off, well... it's
been 25 years). There were other possible tradeoffs too.

robertwessel2@yahoo.com · Mar 13, 2007

Kids don't do math with multipliers.

Kids do math with lookup table for the digits (last digit is X=10)

Usually called math tables I guess:

(...)>
That should do it

Try coming up with an algorithm that uses math tables only

That's what cpu does:

Math tables:

0 * 0 = 0
1 * 0 = 1
0 * 1 = 1
1 * 1 = 1 (and carry ?? <- digit missing should have been X ? )

When you do multiple precision multiplication you tend to use a base
that's convenient to the primitives that the CPU offers. For a CPU
providing a 32x32->64 multiply, working in base 2**32 is usually
convenient. The values of a "digit" range from 4,294,967,295. The
only purpose of the multiply is then to provide the result of
multiplying two digits. For doing the base 10 multiplications you're
describing, kids do indeed usually learn the multiplication table by
rote. the 32x32 multiplier can be implemented by any scheme the CPU
designer chooses, but a lookup table for that size inputs would be
prohibitively large.

On the flip side, multiplication of two single binary digits is
exactly equivalent to an "and" gate. In fact you basic hardware
multiplier is just a bunch of and gates to compute each partial
product), and then a bunch of adders to add them all together.

In general you want to use as large a digit as practical, since most
of the routines will scale with the number of digits, linearly, or
worse.

Doing a multiplier bit-bit bit in software is perfectly simple,
although almost always slower than necessary. It's almost always
easier and faster to deal with one operand a byte/word/etc. at a time,
and the other operand a bit at a time, and then add a bit of basic
shifting and adding, and you're done. This is exactly the scheme used
on many 8 bit CPUs which do not have a multiplier (for example the
6502).

Using a lookup table is also not an unreasonable approach assuming you
don't have a multiplier but you do have a sufficient amount of
reasonably fast memory available. For example, the basic
multiplication algorithm on the 6502 took 200 clocks to multiply two
eight bit numbers. If you were willing to dedicate about 16KB of
memory to lookup tables (of course you never were), you could do it in
about 60 clocks (if either of those clock counts are off, well... it's
been 25 years). There were other possible tradeoffs too.

Michael A. Terrell · Mar 13, 2007

John said:
[....]
software is doing and you'd be right, but it would still be an
implementation in hardware.

Intel are bad teachers.

The CDP1802 did its math 1 bit at a time. The Z80 did internal stuff in
groups of 4 bits. The 8080 did some things in 8 or 16 bits. The PDP-8
did 12 bits. The 68000 did 16 bits. Everyone does a limited number of
bits at a time.

There was at least one PDP-8 model (8S?) that did bit serial
arithmetic too.

Click to expand...

Click to expand...

With discrete transistors!

Yes. it was the "S". Also the 8051 was serial internally.

There was also the MC14500.

If you had enough 2900s, you could do any number of bits.

Click to expand...

The Data General NOVA was a 16-bit machine that used nibble math, with
a single 4-bit ALU, as I recall.

John

I had a pair of Nova computers about 15 years ago.

--
Service to my country? Been there, Done that, and I've got my DD214 to
prove it.
Member of DAV #85.

Michael A. Terrell
Central Florida

Ken Smith · Mar 14, 2007

Terje Mathisen said:
I've written long division routines that work the same way.

For table-driven multiplication I'm partial to x^2 tables, they make it
trivial to fit at least an 8-bit wide table into fast memory.

The LSB of (A-B) is the same as the LSB of (A+B)

(A+B)^2 - (A-B)^2

= A^2 + 2AB + B^2 - A^2 + 2AB -B^2

= 4AB

You can improve the speed by a little creative table generation.

Rob Warnock · Mar 14, 2007

+---------------
| [email protected] (Ken Smith) wrote:
| >>[email protected] says...
| >>There was at least one PDP-8 model (8S?) that did bit serial
| >>arithmetic too.
| >Yes. it was the "S". Also the 8051 was serial internally.
|
| With discrete transistors!
+---------------

The LGP-30 [my first computer!] was bit serial, with
discrete vacuum tubes! And in fact, only 15 bits of state
were stored in the vacuum tube flip-flops -- all of the
rest of it was stored on the drum, including the PC, the
instruction register, and the accumulator. Despite that,
it had hardware multiply (two kinds: save upper 31 bits
of result or save lower 31 bits) and hardware divide. See:

http://ed-thelen.org/comp-hist/lgp-30.html
http://ed-thelen.org/comp-hist/lgp-30-man.html

-Rob

Rich Grise · Mar 14, 2007

+---------------
| [email protected] (Ken Smith) wrote:
| >>[email protected] says...
| >>There was at least one PDP-8 model (8S?) that did bit serial
| >>arithmetic too.
| >Yes. it was the "S". Also the 8051 was serial internally.
|
| With discrete transistors!
+---------------

The LGP-30 [my first computer!] was bit serial, with
discrete vacuum tubes! And in fact, only 15 bits of state
were stored in the vacuum tube flip-flops -- all of the
rest of it was stored on the drum, including the PC, the
instruction register, and the accumulator. Despite that,
it had hardware multiply (two kinds: save upper 31 bits
of result or save lower 31 bits) and hardware divide. See:

http://ed-thelen.org/comp-hist/lgp-30.html
http://ed-thelen.org/comp-hist/lgp-30-man.html

Sounds vaguely like the G-15:
(CAUTION! 2.8 MB .pdf file)
http://www.neodruid.net/Share/G-15D_tech-manual.pdf

Which was the first computer I learned to program, in about 1966.

Cheers!
Rich

Kids can do math better than x86 cpu's.

Kids can do math better than x86 cpu's.

Michael A. Terrell

joseph2k

joseph2k

joseph2k

joseph2k

Terje Mathisen

Ken Hagan

Erik Magnuson

[email protected]

[email protected]

Michael A. Terrell

Ken Smith

Rob Warnock

Rich Grise

Similar threads