.RP
.TL
Running Large Text Processes
.if n .br
on Small
.UX
Systems
.AU
Charles Haley
.AI
.MH
.AU
William Joy
.AI
Computer Science Division
Department of Electrical Engineering and Computer Science
University of California, Berkeley  94720
.AB
.PP
We describe a set of simple modifications to the
Unix
system, which permit larger programs to be run than has previously
been possible.  In particular, the
.I f77
and
.I a68
compilers and version 2 of the
.I ex
editor,
which previously would not run on the non-separate I/D machines
such as the 11/23, 11/34 and 11/40, may be run, without source code
modification, using this scheme.
This scheme will also allow processes larger than 65K bytes of instruction
space to run on all 11/ cpu's with segmentation hardware.
.PP
The overlay scheme used has been designed so that it is transparent to
the C programmer.  Information about which routines are overlayed and in
which overlay they reside is not needed until load time, and only the C
overlay loader
.I covld,
need deal with this.
The system mechanism for implementing overlays should function for
languages other than C (such as
.I a68)
but the current
.I covld
implementation deals specifically with creating load modules for C.
.AE
.SH
Introduction
.PP
The cheap and wide availability of small PDP-11's makes it desirable to have
all of the programs available on the 11/70's and 11/45's of larger
.UX
installations available on the smaller machines such as 11/34's and 11/40's.
To date, this has not been possible, because the smaller machines do
not have the separate instruction and data scheme found on 11/45's and
11/70's which allows 16 bits of instruction space separate from the data
space.
.PP
We have designed and implemented a scheme for running large processes
on machines without this separate I/D feature.  It may also be used to
run processes larger than 65K bytes of instruction space on 11/45's and 11/70's.
.SH
Strategy
.PP
The basic strategy is quite simple.  We resist the complexity of most
overlay schemes, and opt for the following points:
.IP 1)
The overlaying should be (almost) completely invisible to the programmer.
.IP 2)
No restrictions should be made on the language features available to
overlay programs.  In particular, function pointers in C must continue
to work, with all the same properties (uniqueness, etc.)
.IP 3)
The basic system interface should not impose the C runtime organization
any more than the current system does; in particular, other languages
such as
.I a68
should be usable in an overlay fashion, perhaps using a different loader.
.IP 4)
The strategy should be simple to implement.
.SH
New executable file formats
.PP
We have added two new ``magic numbers'' for executable files:
0430 and 0431.  The 0430 magic number corresponds to an overlayed
version of 0410 (shared text) executable files, and the 0431 number
corresponds to an overlayed version of separate I/D spaces files (which
normally have magic number 0411).
.PP
The
.I a.out
file format for these files differs from the normal file format as
follows:
.IP \(**
After the 8 word header and before the text of the program begins
is placed an 8 word array of overlay information.  The first word
of information is the maximum size of any of the overlays, and
the rest of the information gives the sizes of each of the (up to 7)
overlays.
.IP \(**
The text space follows the newly added overlay information, which is
then followed by the text of each overlay.  The overlay text
sizes are all multiple of ``core clicks'', i.e. rounded up to a multiple
of 64 bytes in size.
.LP
The rest of the
.I a.out
file is in the normal format.
.SH
Segmentation register layout
.PP
When an 0430 or 0431 process is run, the overlay information and the
text for all of the overlays are saved by the system.
At any given point during program execution, only one of the
(up to) 7 overlays will be mapped into the process address space,
but the process can request, using a new
.I getovly
system call, that an overlay of its choice be mapped into a
portion of its address space shared by the overlays.
.PP
Thus, there are conceptually four possible usages for each segmentation
register:
.IP 1)
It may be part of the text segment (as before).
.IP 2)
It may be one of the overlay text segment registers, mapping
address space after the text segment (1) but before the data segment (3).
.IP 3)
It may be one of the data segment registers, or
.IP 4)
It may be one of the stack segment registers.
.SH
System management of 0430 and 0431 processes
.PP
There are three major aspects of system handling of the new form
of processes:
.IP 1)
The
.I exec
and related system calls must know how to establish such processes,
and how to detect that they will not fit.
.IP 2)
The
.I estabur
and
.I sureg
mechanisms must know how to set up the segmentation registers for such
processes.
This interface must be modifiable by a system call to permit the currently
chosen overlay to be mapped.
.IP 3)
The scheduler and swapping mechanisms must understand these processes
and allow for enough core space for them (they use more than they
would appear to from the first 8 word header, e.g.).
To simplify this, we have chosen, in this implementation,
to swap the basic text and all overlay text for such processes as one piece.
.PP
The considerations here are relatively straightforward, and will
not be discussed in more detail.
.SH
Loader changes
.PP
We have added two new options to the loader, and made modifications
as necessary to make the overlay loading as transparent as possible.
This modified loader is called
.I covld
and is identical to the normal loader with the addition of the following
two options:
.IP \fB\-Z\fR
marks the beginning of an overlay.  The routines in the files to the
next
.B \-Z
or
.B \-L
option are placed in the next overlay (numerically).
.IP \fB\-L\fR
marks the end of all overlays.  The rest of the routines go into the
base segment.\(dg
.FS
\(dg \fBL\fR was chosen because it was unused and can be thought
of as ``library''.  The \fBZ\fR option has no mnemonic value.
.FE
.PP
Here is a sample loader command to
.I covld
which loads the
.I ex
editor into a base segment and four overlays:\(dd
.FS
\(dd The \fB\-lcov\fR library differs from \fB\-lc\fR in that it is compiled
with
.I ovcc
instead of
.I cc.  There is only a one line difference in the source for
.I ovcc
and
.I cc:
.I ovcc
uses a one word larger stack mark (it stacks overlay numbers of return
addresses in the extra word).  Unfortunately, this requires that the
library routines also allocate and preserve this extra word if they
are to live in overlays or call overlaid routines which may cause
overlay switching.  Thus, for generality, we have them always save
and restore this number.
.FE
.DS
covld \-X /lib/crt0.o \-n\e
    \-Z ex_addr.o ex_cmds.o ex_cmds2.o ex_cmdsub.o ex_re.o ex_set.o ex.o\e
    \-Z ex_vadj.o ex_vmain.o ex_voperate.o ex_vwind.o ex_vops3.o\e
    \-Z ex_v.o ex_vget.o ex_vops.o ex_vops2.o ex_vput.o\e
    \-Z ex_get.o ex_io.o ex_temp.o ex_tty.o\e
    \-L ex_put.o ex_subr.o printf.o strings.o doprnt.o\e
       ex_data.o termlib/termlib.a \-lcov
.DE
and a (modified version of the)
.I size
command run on the resulting
.I a.out
file yields:
.DS
16000+(15808,14848,13632,9216)+3202+7320 = 26522b = 063632b (69504 total text)
.DE
.PP
We have designed this overlay to use two segmentation registers
(each register maps 8192 bytes) for the root segment, and two registers
for each overlay.  This leaves four segmentation registers, which could
map 24K bytes of data and bss and dynamic space, and one register which
could map 8K bytes of stack.
.PP
As normally loaded for an 11/70, this version of the editor uses 64000 bytes
of text space.  The additional 5K bytes is taken up by interface code
to handle the overlays, which we will describe shortly.
.PP
One other point to be noted is that the namelist (symbol table) format
for the
.I a.out
file has been changed slightly for the overlay loaded
.I a.out's.
Previously, there was an unused byte in the format (basically, the high
byte of the ``type'' field of the namelist), and this field is now used
to contain the segment number where overlay routines reside.
Consider the following files:
.DS
---x0.c---
main()
{
    foo();
    foobar();
}
base() { }
---x1.c---
foo()
{
    base();
    ov2();
}
ov1() { }
---x2.c---
foobar()
{
    base();
    ov1();
}
ov2() { }
.DE
An appropriately modified
.I nm
command on a file loaded via:
.DS
covld -X -t -n /lib/crt0.o  x0.o -Z x1.o -Z x2.o -L ovcsv.o -lc
.DE
produces the following output:
.br
.ne 5
.ID
000310 T __cleanu
040000 D __ovno
000256 T _base
040010 B _environ
000312 T _etext
000272 T _exit
000312 T _foo     1
000334 T _foobar  2
000232 T _main
000400 T _ov1     1
000356 T _ov2     2
000030 T cret
000136 f crt0.o
000010 T csv
000272 f cuexit.o
000001 a exit
000001 a exit
000310 f fakcu.o
000045 a getovly
000000 a indir
000030 T ovcret
000000 T ovcsv
000000 f ovcsv.o
000106 T ovhndlr
040012 B savr5
000136 T start
000232 f x0.o
020000 f x1.o
020000 f x2.o
000256 t ~base
020000 t ~foo
020000 t ~foobar
000232 t ~main
020024 t ~ov1
020024 t ~ov2
.DE
Note that the addresses for
.I _foo
and
.I _foobar
appear in the base segment.  In fact, the true routines appear in
the overlaid segment (at
.I ~foo
and
.I ~foobar)
but the base segment contains an interface routine, which both insures
that they are mapped into the address space before transferring to them
and also allows function pointers to exist and work as normal.
.SH
Thunks
.PP
To interface each routine which is in an overlay to the outside world,
we add some interface which we (somewhat abusing the terminology) call
a ``thunk''.  This code has the following form:
.DS
_foo:
        mov     *$__ovno,r0
        cmp     r0,$foo's_ovno
        bne     1f
2:
        jmp     *$~foo
1:
        jsr     pc,ovhndlr
.DE
Thus there is 18 bytes of interface code for each overlaid routine.
.PP
Note that the interface code places the
.I previous
current overlay number, which is kept by the software in
.I __ovno,
in register 0.  In fact, the overlay handler
.I ovhndlr
also leaves it there.  If the current overlay is not the
correct overlay, then the
.I ovhndlr
corrects this by issuing a system call, and then returns 8 bytes
before where it would normally, i.e. at the label which is ``2b''.
.PP
The C save and restore sequences and the handler for use with overlayed
text register programs are coded as follows:
.br
.ne 5
.ID
.ta 8n 16n 24n 32n 40n 48n 56n
/ C register save and restore -- version 7/75
/ modified by wnj && cbh 6/79 for overlayed text registers
/ we define ovcsv and ovcret which overlay routines call
/ even though ovcret is (just now) the same as cret
/ the loader finagles the .o files so this happens
\&.globl	csv
\&.globl	cret
\&.globl	ovcsv, ovcret, ovhndlr
\&.globl  __ovno
\&.globl  _etext
\&.data
__ovno:	0
\&.text
indir=0
getovly=45		/ nabbing one of the two free slots, sigh...
/ csv for routines in overlays
/ since the current overlay is now stored in __ovno
/ these routines have to save out of r0 where old one was stashed
ovcsv:
	mov	r0,-(sp)
	mov	r5,r0
	mov	sp,r5
	jbr	1f
/ only base segment routines call csv, and when it is called
/ no one has worked on the current overlay, so we just save the number
/ of our caller
csv:
	mov	r5,r0
	mov	sp,r5
	mov	__ovno,-(sp)	/ overlay is extra (first) word in mark
/ rest is old code common with ovcsv
1:
	mov	r4,-(sp)
	mov	r3,-(sp)
	mov	r2,-(sp)
	jsr	pc,(r0)		/ jsr part is sub $2,sp
ovcret:		/ same as cret, i think
cret:
	mov	r5,r2
/ get the overlay out of the mark, and if it is non-zero
/ make sure it is the currently loaded one
	mov	-(r2),r4
	bne	1f		/ zero is easy
2:
	mov	-(r2),r4
	mov	-(r2),r3
	mov	-(r2),r2
	mov	r5,sp
	mov	(sp)+,r5
	rts	pc
/ not returning to base segment, so check that the right
/ overlay is loaded, and if not ask UNIX for help
1:
	cmp	r4,__ovno
	beq	2b		/ lucked out!
/ if return address is in base segment, then nothing to do
	cmp	(r5),$_etext
	blt	2b
/ returning to wrong overlay --- do something!
	mov	r4,3f		/ seg no. desired
	mov	r4,__ovno
/ intr. routines may run between these, so should force segment __ovno
	sys	indir; 2f
/ could measure switches[ovno][r4]++ here
	jmp	2b
\&.data
2:	sys	getovly; 3: ..
\&.text
/ ovhndlr makes the argument (in r0) be the current overlay
/ and returns in a funny way, subtracting 8 off the return address
/ to return to the jmp instruction in the thunk
ovhndlr:
	mov	r0,3b		/ seg no. desired
	mov	__ovno,r1
	mov	r0,__ovno
/ intr. routines may run between these, so should force segment __ovno
	sys	indir; 2b
	mov	r1,r0		/ old overlay number
/ could measure switches[ovno][r4]++ here
	mov	(sp)+,r1
	jmp	-8.(r1)		/ to jmp sp->sovalue in thunk
.DE
.PP
One subtle point here is that routines which are in overlays
are made to call the routines
.I ovcsv
and
.I ovcret
rather than
.I csv
and
.I cret.
This allows for faster saving of the previous overlay value in this case.
.SH
Future improvements
.PP
If interactive debugging of 0430 and 0431 files is to occur, then
.I adb
must be changed to deal with the new format of the
.I a.out
files.  We have not yet made the needed changes.
.PP
The mechanism here substantially improves the capability of a large class
of small 11's.  For machines with small amounts of real memory, it would
be nice if the text images of these 0430 and 0431 files would not
have to be completely resident to run.  Thus the individual overlays
could be swapped rather than being made part of the larger text segment.
This appears substantially more difficult to implement than the
present mechanism, for two reasons:
.IP 1)
It is a major change to the text mechanism, basically allowing
more than one text per process, and making the amount of core
required by a process much more dynamic.  Care must be taken in changing
the text mechanism of the system to allow this.
.IP 2)
Substantially more changes are needed to the scheduling algorithm in
the system to assign appropriate priorities to a new class of objects:
overlay text portions which are not currently ``active''.  It
seems pointless to implement this scheme if they are simply
abandoned as soon as they become free.  We suggest that they be
given ``abandon'' priority which keeps them just longer than slow terminal
i/o waits.
.PP
We have implemented the segment switching primitive as a system call.
It could be implemented as a different and special trap (i.e. an
emulator trap EMT) and a special handler could be created.  This
would make segment switching as much as ten times faster.
