High Performance Computing

Overview of Projects

I have always been attracted to the "performance" or speed-related aspects of computer programming problems. Here I am using a broad definition of "high perfomance", including anything that would not be possible if the hardware were half as fast. For example, all of the projects during the first 8 years were in assembly language (for the Z80, 6502 and 68000 microprocessors) because that's what had to be done to get the performance.

The column headings are defined below the table. Each name links to a brief description. Some of the descriptions also link elsewhere on my website.

date	cpu	SMP	distrib	vector	name
1978xxxx	*				MUSIC4
1980xxxx	*				ROUTONES
19821xxx	*	¹			16 SPRITES
1983xxxx					SWAP SWAP
1984xxxx	*				Orion
1984xxxx	*				Super MANDELZOOM
1990xxxx		*	*		parallel1
199xxxxx	*		*		LUAMS
199xxxxx	*			*	rgb2hsv
19990518	*		*		luams-3
20000719	*				LMZ
20040804		*	*		parallel2
20040922				GPU	(first experiments, lost)
20060801	*	*	*		TMZ
20081027	*	*			dicek
20090106	*		*		vision-board
20090202	*	*			A094358
20090213			*		buddha
20090214	*	*		*	PDE4 (Xmorphia)
20090218	*			*	LINPACK
20090810	*				rubik3
20091003	*	*			A020916_parallel
20091025	*		*		Sloandora
20091104	*	*			islands
20091227	*		*		A005646
20100927	*	*			TMZ-core
20100928	*			GPU	Xmorphia PDE5
20110521					Turbo mode demos

abbreviations

cpu : Software is "CPU-intensive"

smp : Symmetric multiprocessing is used: A single program that runs multiple threads communicating with each other through shared memory and multithreading library functions.

distrib : "Distributed" multiprocessing is used: Several programs (usually one or two per processor core) that communicate only through pipes, the OS, or the filesystem.

vector : Software uses SIMD hardware, such as the "vector" processing capability of the CPU, or the GPU.

footnotes

1 : : Not "parallel" by today's standards, but at the time it would have been considered parallel.

MUSIC4

was an assembly-language 4-voice music synthesizer for the Z80. At the time no polyphonic music software existed. I had to use all of 16 registers and made a cardboard model to keep track of which register banks were swapped in as I traced through the code by hand.

ROUTONES

was a set of assembly-language routines for the Apple II, including an arbitrary-precision integer arithmetic (or "BIGINT") library. I used it to compute the exact values of numbers like 7²⁰⁰.

16 SPRITES on C-64 (real-time interrupt handler)

The Commodore-64 display had only 8 hardware "sprites" (movable graphic objects) but there were interrupt vectors to allow clever programmer to move the sprites so quickly as to create the illusion of more (by moving in synchonization with the raster scan of the video output hardware).

I provided 16 sprites, each with a bank of control registers including position and velocity at 1/256-pixel resolution. The software was an interrupt handler that ran 960 times per second, maintaining a sorted linked list, and using about 40% of the CPU's runtime. 1983 was fun (-:

SWAP-SWAP on C-64 (cooperative multi-tasking executive)

Another interrupt handler that performed a context switch between two different virtual-machines in response to a little-used keyboard combination. The registers, stack, "page 0", interrupt vectors, screen contents, and other video registers (including sprites) were saved and restored. Using this software, the user could have two separate BASIC programs in memory at the same time, each with its own data in memory, screen display, etc. Swapping caused disk I/O to suspend and resume without error.

I sent this project to Creative Computing magazine and was paid an author's fee; it was to be included in a book by the magazine, but that project was cancelled.

Orion

was a space flight simulator for the original Apple Macintosh. When I interviewed for my first post-college job I brought Orion as a demo. The machine they gave me to run it on seemed to be running 50% faster, so I told them I thought it must be running at about 12 MHz. I was right.

Super MANDELZOOM (cooperative multi-tasking using coroutines)

Super MANDELZOOM was compute-intensive in a trivial way, because its purpose was to view the Mandelbrot set. However I did a lot of work to make it seem much faster: four sets of arithmetic routines (16-bit, 18-bit, 30-bit and 32-bit) for graceful degradation by only using as much precision as is actually needed, and a series of scans with successive refinement so that the user can get a rough view quickly, which is often enough to decide where to scroll or zoom next. The user interface used co-routines to simulate multi-tasking, so that user interface operations like rubber-band rectangle selection did not interrupt the in-progress scanning and plotting of the Mandelbrot image.

Parallel 1 (old parallel algorithm testbed)

Not an actual parallel program, but a bunch of routines to simulate the data flow between multiple CPUs in a message-passing SMP design. I used it to work out the details of the bitonic sort algorithm on hypercube networks (as described in The Connection Machine by Daniel Hillis) and explore a few simple SIMD ideas.

LUAMS

(original early 1993 and 1996 versions)

In this project I measured the area of the Mandelbrot by simple "pixel-counting". It wasn't really "parallel", just an easily-divided task being done in pieces on multiple machines.

rgb2hsv

function

This was my submission to the MacTutor magazine monthly programmer's challenge. The task was to convert a 24-bit, 3-component RGB tuple (three 8-bit values) to the corresponding 24-bit HSV values, with proper rounding, in the fastest possible time and with a limit of 128K of memory. I won the contest, narrowly beating the next-best entrant by using "parallel addition".

By using the full width of the processor's registers, allocating 10 or 11 bits to each component (8 + 8 + 8 with some padding for unwanted carry-bits, for a total of 32 bits), three-component vectors with 8-bit elements can be added in parallel.

luams-3

Just a port of LUAMS to another operating system.

LMZ (X windows mandelzoom project)

Just another Mandelbrot program, this one written for UNIX with an X Windows user interface.

proj/parallel2 (algorithm simulation only)

Re-implementation of the parallel1 work on a different OS.

proj/TMZ (ideas only)

Notes on a Mandelbrot design that supports both multi-threaded and distributed computing, and both (e.g. four networked CPUs each running one or more threads). No code written.

proj/libs/dicek (ideas only)

Beginnings of a runtime library to provide parallel data and execution abstractions, layered on top of pthreads. I got part of the API working, but had nothing I wanted to use it for, and it was made obsolete by OpenMP.

vision-board script

This is a very simple example of using multiple CPUs very effectively. It simply launches a large number of background tasks (instances of the recompress-jpeg script) in the background and uses the task_xxx routines to monitor them.

The first version was in 20080121, but it wasn't parallelized until 20090106. I started using the task_xxx library (in rpmlib.pl) on 20090215.

On 20120814 I added the "-thr" and "-purge" options, making it a lot easier to do benchmarks in vision-board. There is a benchmarks results table in the revision history of the vision-board Perl source. Results range from 1h20.8m on the 800 MHz PowerPC G4 to 1m28s on an 8-core workstation (dual Xeon E5520).

A094358 (part of proj/math)

Multi-threaded implementation of a search for counterexamples to the "2^^n = 1 mod n" hypothesis of integer sequence A094358.

buddha (multi-threaded but not CPU-limited)

This just plays multipe sound files in loops using multiple copies of VLC; it detects when each VLC task exits and spawns another with randomly selected parameters.

PDE4 (Xmorphia)

LINPACK

A somewhat customized version of the 1988 Bonnie Toy version (with Jack Dongarra's 1994 fix to daxpy). Presently neither vector nor multithreaded, but I began changes to prepare for both.

rubik3

A020916_parallel (part of proj/math)

Sloandora

Sloandora is an interactive discovery aid for researchers exploring the OEIS. It has its own page here: Sloandora: A metric and UI for Integer Sequence exploration.

islands

islands is another Mandelbrot program, a specialized project I created in 2003 December. It is massively parallel, but in a non "embarrassing" way. One thread, the prospector, executes an algorithm similar to the luams-3 grid scan, but spots island mu-molecules as clumps of pixels; when one is found it schedules work for one or more surveyor threads which make a high-precision measurement of area, location, and period or each.

A005646

TMZ-core

Xmorphia PDE5 screensaver

Turbo mode demos

On Mac OS it is nearly impossible to find out how fast the CPU clock is going on an Intel processor that has variable clock speed. This includes all of the "Core i" processors like the Core i7 in my MacBook Pro.

In a UNIX shell, type each of these two commands:

perl -e 'sleep(10);$|=1;while(1){print"."if(((++$i)%300000)==0);}' & while (echo -n); do echo x; sleep 1; done

This will generate a row of dots and an 'x' each second; the number of dots in each row is proportional to the speed the task is running.

To stop this display, first hit control-C to stop the foreground task (which prints the x's). Then type "kill %" and hit return. You won't be able to see the kill % clearly but it will work.

To see a significant change in the clock speed, you should make other cores busy. My MacBook Pro has 4 cores, so I made the other 3 cores busy by starting three Terminal windows, and in each one typing yes > /dev/null.

This page was written in the "embarrassingly readable" markup language RHTF, and was last updated on 2014 Dec 13.

s.27