Ch02 - Assemblers - SystemSoftware | ChienI Kao = Nothing is true, everything is permitted

# Introduction

Design and implementation of assemblers

Some common fundamental functions such as:
- translating mnemonic operation codes to their machine language equivalents
- assigning machine addresses to symbolic labels used by the programmer.
The features and design of an assembler depend heavily upon the source language it translates and the machine language it produces.
Machine dependent/independent features

# Basic Assembler Functions

Example of SIC assembler language program
Assembler Directive: 在組語程式中，用來告訴 Assembler 某些資訊或是該做哪些動作，不屬於 CPU 的指令
- START : specify name and starting address for the program.
- END : indicate the end of the source program and (optionally) specify the first executable instruction in the program.
- Variable/Constant Declaration
  - BYTE
  - WORD
  - RESB
  - RESW
The line numbers are for reference only and are not part of the program

# A Simple SIC Assembler

The translation of source program to object code requires us to accomplish the following functions
(not necessarily in the order given)
1. Convert mnemonic operation codes to their machine language equivalents
  - translate STL to 14 (line 10)
2. Convert symbolic operands to their equivalent machine addresses
  - translate RETADR to 1033 (line 10)
3. Build the machine instruction in the proper format.
4. Convert the data constants specified in the source program into their internal machine representations
  - translate EOF to 454F46 (line 80)
5. Write the object program and the assembly listing.

# Forward Reference & Two Passes

The instruction at line 10 contains a forward reference.
- Line 10 存取 RETADR ，但 RETADR 在 line 95 才定義
Because of forward reference, most assemblers make two passes over the source program.
- The first pass:
  - does little more than scan the source program for label definitions and assign addresses.
- The second pass:
  - performs most of the actual translation previously described.
In addition, the assembler must process statements called assembler directives (or pseudo-instructions).

# Simple Object Program

The simple object program format we use contains three types of records.
- Header
  - program name
  - starting address
  - length
- Text
  - translated instructions
  - data
  - indication of the addresses where these are to be loaded
- End
  - marks the end of the object program
  - specifies the address in the program where execution is to begin

Header record:

Col. 1: H
Col. 2-7: Program name
Col. 8-13: starting address of object program
Col. 14-19: length of object program in bytes

Text record:

Col. 1: T
Col. 2-7: starting address for object code in this record
Col. 8-9: length of object code in this record in bytes
Col. 10-69: object code, represented in hexadecimal

Object code 的部分可以容納 60 個 symbol（每個 4 bit），若以 byte 計算就是 30 bytes。
一個 instruction 長 3 bytes，所以可以放 10 個 instructions
在指令不連續時（通常是中間有宣告變數），會在沒填滿的情況下使用新的 text record。

End record:

Col.1: E
Col. 2-7: address of first executable instruction in object program

There is no object code corresponding to addresses 1033-2038.
This storage is simply reserved by the loader for use by the program during execution.

# Two Passes

Pass 1 (define symbols):

Assign addresses to all statements in the program.
Save the values (addresses) assigned to all labels for use in Pass 2.
Perform some processing of assembler directives.
- e.g., determining the length of data areas defined by BYTE, RESW, etc.)

Pass 2 (assemble instructions and generate object program):

Assemble instructions (translating operation codes and looking up addresses)
Generate data values defined by BYTE, WORD, etc.
Perform processing of assembler directives not done during Pass 1.
Write the object program and the assembly listing.

# Assembler Algorithm and Data Structures

Major internal data structures of simple assembler:

Operation code table (OPTAB)
Symbol table (SYMTAB)
Location Counter (LOCCTR)
- Initialized to the beginning address specified in the START statement.

# Operation Code Table (OPTAB)

Contains (at least) the mnemonic operation code and its machine language equivalents
- In more complex assemblers, this table also contains information about instruction format and length.
OPTAB is usually organized as a hash table, with mnemonic operation code as the key.
- The information in OPTAB is predefined when the assembler itself is written.

Example of OPTAB

Mnemonic	Machine Language
`ADD`	18
`ADDF`	58
`ADDR`	90
…	…

Operation code table 的內容不太會變動
使用目的是用來加快 mnemonic $\Rightarrow$ $\Rightarrow$ machine language 的速度
- 需要使用查詢速度很快的資料結構
所以會使用 Hash Table 來實作 OPTAB
- 查詢的平均複雜度是 $O(1)$

# Symbol Table (SYMTAB)

Includes the name and value (address) for each label in the source program
- together with flags to indicate error conditions
  - e.g., a symbol defined in two different places
It may also contain other information about the data area or instruction labeled
- for example, its type or length.
During Pass 1
- labels are entered into SYMTAB as they are encountered in the source program
- along with their assigned addresses.
During Pass 2
- symbols used as operands are looked up in SYMTAB
  $\Rightarrow$ the addresses to be inserted in the assembled instructions.
SYMTAB is usually organized as a hash table
- for efficiency of insertion and retrieval.
Since entries are rarely deleted from this table, efficiency of deletion is not an important consideration.

Example: SYMTAB of SIC Program

Label	Location
`COPY`	1000
`FIRST`	1000
`CLOOP`	1003
`ENDFIL`	1015
`EOF`	102A
`THREE`	102D
…	…

對 hash table 來說，insertion (插入) 和 retrieval (查詢)，平均複雜度都是 $O(1)$

# Hash Table

Care should be taken in the selection of a hashing function.
- Programmers often select many labels that have similar characteristics
  - for example, labels that start or end with the same characters (like LOOP1, LOOP2, LOOPA).
- It is important that the hashing function used perform well with such non-random keys.
  - 對於任意的 key，hash 的結果越不容易重複越好
- Division of the entire key by a prime table length often gives good results.

# Algorithms of Passes

Pass 1 usually writes an intermediate file

contains each source statement together with its assigned addresses, error indicators, etc.
This file is used as the input to Pass 2.

Pseudo Code of Pass1

	begin
	read first input line
	if OPCODE = 'START' then
	begin
	save #[OPERAND] as starting address
	initialize LOCCTR to starting address
	write line to intermediate file
	read next input line
	end {if START}
	else
	initialize LOCCTR to 0
	while OPCODE != 'END' do
	begin
	if this is not a comment line then
	begin
	if there is a symbol in the LABEL field then
	begin
	search SYMTAB for LABEL
	if found then
	set error flag (duplicate symbol)
	else
	insert (LABEL, LOCCTR) into SYMTAB
	end {if symbol}
	search OPTAB for OPCODE
	if found then
	add 3 {instruction length} to LOCCTR
	else if OPCODE = 'WORD ' then
	add 3 to LOCCTR
	else if OPCODE = 'RESW' then
	add 3 * #[OPERAND] to LOCCTR
	else if OPCODE = 'RESB' then
	add #[OPERAND] to LOCCTR
	else if OPCODE = 'BYTE' then
	begin
	find length of constant in bytes
	add length to LOCCTR
	end {if BYTE}
	else
	set error flag (invalid operation code)
	end {if not a comment}
	write line to intermediate file
	read next input line
	end {while not END}
	write last line to intermediate file
	save (LOCCTR - starting address) as program length
	end {Pass 1}

Presudo Code of Pass2

	begin
	read first input line {from intermediate file}
	if OPCODE = 'START' then
	begin
	Write listing line
	read next input line
	end {if START}
	write Header record to object program
	initialize first Text record
	while OPCODE != 'END' do
	begin
	if this is not a comment line then
	begin
	search OPTAB for OPCODE
	if found then
	begin
	if there is a symbol in OPERAND field then
	begin
	search SYMTAB for OPERAND
	if found then
	store symbol value as operand address
	else
	begin
	store 0 as operand address
	set error flag (undefined symbol)
	end
	end {if symbol}
	else
	store 0 as operand address
	assemble the object code instruction
	end {if opcode found}
	else if OPCODE = 'BYTE' or 'WORD' then
	convert constant to object code
	if object code will not fit into the current Text record then
	begin
	write Text record to object program
	initialize new Text record
	end
	add object code to Text record
	end {if not comment }
	write listing line
	read next input line
	end {while not END}
	write last Text record to object program
	write End record to object program
	write last listing line
	end {Pass 2}

# Machine-Dependent Assembler Features

Example of SIC/XE Program
The assembler directive BASE is used in conjunction with base relative addressing.
- 用 BASE 讓 assembler 知道 base register 的值
- Assembler 必須知道 base 的值，才能計算指令中的 displacement
If the displacements required for relative addressing are too large to fit into a 3-byte instruction
$\Rightarrow$ the 4-byte extended format (Format 4) must be used.
- The bit e is set to 1 to indicate extended instruction format.
- The extended instruction format is specified with the prefix + added to the operation code in the source statement.
- It is the programmer’s responsibility to specify this form of addressing when it is required.
Advantages of the more advanced SIC/XE architecture to improve the execution speed of the program
- Register-to-register instructions are faster than register-to-memory operations.
- When using immediate addressing, the operand is already present as part of the instruction and need not be fetched from anywhere.
- The use of indirect addressing often avoids the need for another instruction.

# Instruction Formats and Addressing Modes

The conversion of register mnemonics to numbers can be done with a separate table
- however, it is often convenient to use the symbol table for this purpose.
- To do this, SYMTAB would be preloaded with the register names and their values.
If extended format is not specified, the assembler first attempts to translate the instruction using program-counter relative addressing.
- If this is not possible (because the required displacement is out of range), the assembler then attempts to use base relative addressing.
- If neither form of relative addressing is applicable and extended format is not specified, then the instruction cannot be properly assembled.
  - In this case, the assembler must generate an error message.

# Displacement Calculation

The computation that the assembler needs to perform is essentially the target address calculation in reverse.
- $disp = TA - (B)$
- $disp = TA - (PC)$

PC Relative

The program counter is advanced after each instruction is fetched and before it is executed.
- Thus during the execution of an instruction, program counter will contain the address of the next instruction.
The assembler knows what the contents of the program counter will be at execution time.

Base Relative

The base register is under control of the programmer.
The programmer must tell the assembler what the base register will contain during execution
$\Rightarrow$ $\Rightarrow$ so that the assembler can compute displacements.
- This is done with the assembler directive BASE .
- NOBASE
- These assembler directives produce no executable code.
- The programmer must provide instructions that load the proper value into the base register during execution.

# Program Relocation

It is often desirable to have more than one program at a time sharing the memory and other resources of the machine.
- In such situation the actual starting address of the program is not known until load time.
- Some address protions of the program need to be changed, while some parts (such as constants) should remain the same regardless of where the program is loaded.
- Looking at the object code alone, it is in general not possible to tell which values represent addresses and which represent constant data items.
Since the assembler does not know the actual location where the program will be loaded, it cannot make the necessary changes in the addresses used by the program.
However, the assembler can identify for the loader those parts of the object program that need modification.

Example

We assembled this program using a starting address of 0000 .
- The JSUB instruction from line 15 is loaded at address 0006 .
  The address field of this instruction contains 01036 , which is the address of the instruction labeled RDREC .
- Now, suppose we want to load this program beginning at address 5000 .
  The address of the instruction labeled RDREC is then 6036 .
  Thus the JSUB instruction must be modified as shown to contain this new address.

We can solve the relocation problem in the following way:
- When the assembler generates the object code, it will insert the address of relative to the start of the program.
- The assembler will also produce a command for the loader, instructing it to add the beginning address of the program to the address field at load time.
Modification record
- Col. 1: M
- Col. 2-7: Starting location of the address field to be modified, relative to the beginning of the program.
- Col. 8-9: Length of the address field to be modified, in half-bytes
The length is stored in half-bytes (rather than bytes) because the address field to be modified may not occupy an integral number of bytes. (ex. the address field in the JSUB we considered above occupies 20 bits, which is 5 half-bytes)
If this field occupied an odd number of half-bytes, it is assumed to begin in the middle of the first byte at the starting location.

M00000705 for line 15

# Machine-Independent Assembler Features

# Literal (字定)

  - A literal is identified with the prefix `=`, which is followed by a specification of the literal value, using the same notation as in the *BYTE* statement.

The effect of using a literal is exactly the same as if the programmer had defined the constant explicitly and used the label assigned to the constant as the instruction operand.
Difference between a literal and an immediate operand
- With immediate addressing, the operand value is assembled as part of the machine instruction.
- With a literal, the assembler generates the specified value as a constant at some other memory location.
All of the literal operands used in a program are gathered together into one or more literal pools. Normally literals are placed into a pool at the end of the program.
In some case, it is desirable to place literals into a pool at some other location in the object program. We use the assembler directive LTORG .
- When the assembler encounters a LTORG statement, it creates a literal pool that contains all of the literal operands used since the previous LTORG (or the beginning of the program).
- This literal pool is placed in the object program at the location where the LTORG directive was encountered.
If we had not used LTORG statement, literals would be placed in the pool at the end of the program. This means that the literal operand would be placed too far away from the instruction referencing it to allow program-counter relative addressing.
Most assemblers recognize duplicate literals — that is, the same literal used in more than one place in the program — and store only one copy of the specified data value.

Literal Table LITTAB

This table contains the literal name, the operand value and length, and the address assigned to the operand when it is placed in a literal pool.
LITTAB is often organized as a hash table, using the literal name or value as the key.
Pass 1:
- The assembler searches LITTAB for the specified literal name (or value). If the literal is already present in the table, no action is needed; if it is not present, the literal is added to LITTAB (leaving the address unassigned)
- When pass 1 encounters a LTORG statement or the end of the program, the assembler makes a scan of the literal table. At this time each literal currently in the table is assigned an address (unless such an address has already been fill in).
- As these addresses are assigned, the location counter is updated to reflect the number of bytes occupied by each literal.
Pass 2:
- The operand address for use in generating object code is obtained by searching LITTAB for each literal operand encountered.

# Symbol-Defining Statements

Symbol EQU value
The value may be given as a constant or as any expression involving constants and previously defined symbols.

One common use of EQU is to establish symbolic names that can be used for improved readability in place of numeric values.
1. ```
+LDT  #4096
```
2. ```
MAXLEN  EQU  4096
+LDT   #MAXLEN
```
The resulting object code is exactly the same as in the original version of the instruction; however, the source statement is easier to understand. It is also much easier to find and change the value of MAXLEN.
Another common use of EQU is in defining mnemonic names for registers.
```
A  EQU 0
X  EQU 1
L  EQU 2
```

# `ORG` value

Where value is a constant or an expression involving constants and previously defined symbols.

When this statement is encountered during assembly of a program, the assembler resets its location counter to the specified value.

Consider a symbol table STAB with 100 entries, the SYMBOL field contains a 6-byte user-defined symbol; VALUE is a one-word representation of the value assigned to the symbol; FLAGS is a 2-byte field that specifies symbol type and other information.
- ```
STAB  RESB  1100
SYMBOL   EQU  STAB
VALUE    EQU   STAB+6
FLAGS   EQU   STAB+9
LDA   VALUE, X
```
- However, this method of definition simply defines the labels; it does not make the structure of the table as clear as it might be.
- ```
STAB   RESB   1100
       ORG    STAB
SYMBOL   RESB  6
VALUE    RESW  1
FLAGS    RESB  2
       ORG    STAB+1100
```
- The last ORG statement is very important. It sets LOCCTR back to its previous value.
- In some assemblers the previous value of LOCCTR is automatically remembered, so we can simply write
```
ORG
```
All symbols used on the right-hand side of the statement—that is, all terms used to specify the value of the new symbol—must have been defined previously in the program. (e.g., the following example would not be allowed)
- ```
BETA   EQU   ALPHA
ALPHA  RESW  1 
```

# Expressions

Individual terms in the expression may be constants, user-defined symbols, or special terms (e.g., the value of the location counter)
The values of terms and expressions are either relative or absolute.
- A constant is, of course, an absolute term.
- Labels on instructions and data areas, and references to the location counter value, are relative terms.
An expression that contains only absolute terms is, of course, an absolute expression.
Absolute expressions may also contain relative terms provided the relative terms occur in pairs and the terms in each such pair have opposite signs.
- None of the relative terms may enter into a multiplication or division operation.
A relative expression is one in which all of the relative terms except one can be paired as described above; the remaining unpaired relative term must have a positive sign.

Absolute, relative, or error? (in Figure 2.10)

0030   RETADR   RESW   1      Relative      // Absolute
0036   BUFFER   RESB   4096   Relative      // Absolute
1036   BUFFEND  EQU    *                    // Relative
1000   MAXLEN   EQU    BUFEND-BUFFER        // Absolute
BUFEND+BUFFER     // Error
100-BUFFER        // Error
3*BUFFER          // Error

We need a flag in the symbol table to indicate type of value (absolute or relative) in addition to the value itself.

# Program Blocks

We use the term program blocks to refer to segments of code that are rearranged within a single object program unit.
The assembler directive USE indicates which portions of the source program belong to the various blocks.
Figure 2.11
- The first (unnamed) program block contains the executable instructions of the program.
- The second (named CDATA) contains all data areas that are a few words or less in length.
- The third (named CBLKS) contains all data areas that consist of larger blocks of memory.
The assembler will (logically) rearrange these segments to gather together the pieces of each block. These blocks will then be assigned addresses in the object program, with the blocks appearing in the same order in which they were first begun in the source program.
The assembler accomplishes this logical rearrangement of code by maintaining, during Pass 1, a separate location counter for each program block.
When labels are entered into the symbol table, the block name or number is stored along with the assigned relative address.
At the end of Pass 1 the latest value of the location counter for each block indicates the length of that block. The assembler can then assign to each block a starting address in the object program.
For code generation during Pass 2, the assembler simply adds the location of the symbol, relative to the start of its block, to the assigned block starting address.
Figure 2.12
- Notice that the value of the symbol MAXLEN (line 107) is shown without a block number. This indicates that MAXLEN is an absolute symbol, whose value is not relative to the start of any program block.
- Line 20
We can immediately see that the separation of the program into blocks has considerably reduced the addressing problems. Because the large buffer area is moved to the end of the object program, we no longer need to use extended format instructions on lines 15, 35, and 65.
Program readability is often improved if the definitions of data areas are placed in the source program close to the statements that reference them.
The use of program blocks is one way of satisfying both of these requirements, with the assembler providing the required reorganization.
It is not necessary to physically rearrange the generated code in the object program to place the pieces of each program block together. The assembler can simply write the object code as it is generated during Pass 2 and insert the proper load address in each Text record. The loader will simply load the object code from each record at the indicated address.
- Figure 2.13, Figure 2.14

# Control Sections and Program Linking

A control section is a part of the program that maintains its identity after assembly; each such control section can be loaded and relocated independently of the others.
When control sections form logically related parts of a program, it is necessary to provide some means for linking them together.
Instructions in one control section might need to refer to instructions or data located in another section. The assembler is unable to process these references in the usual way. The assembler has no idea where any other control section will be located at execution. Such references between control sections are called external references.

# Figure 2.15

CSECT
Three control sections: COPY , RDREC , WRREC
The assembler establishes a separate location counter for each control section, just as it does for program blocks.
EXTDEF (external definition for external symbol)
EXTREF (external reference)

# Figure 2.16

Line 15:

15    0003    CLOOP     +JSUB	RDREC		4B100000

Line 160

160   0017              +STCH	BUFFER,X	57900000

Line 190
```
190   0028    MAXLEN    WORD	BUFEND-BUFFER	000000
```
- The assembler has no idea where the control section containing RDREC will be loaded, so it cannot assemble the address for this instruction. Instead the assembler inserts an address of zero and passes information to the loader, which will cause the proper address to be inserted at load time.
- The address of RDREC will have no predictable relationship to anything in this control section therefore relative addressing is not possible. Thus an extended format instruction must be used to provide room for the actual address to be inserted.
The assembler must also allow the same symbol to be used in different control sections.

Define record

Col. 1: D
Col. 2-7: name of external symbol defined in this control section
Col. 8-13: relative address of symbol within this control section
Col. 14-73: Repeat information in Col. 2-13 for other external symbols.

Refer record

Col. 1: R
Col. 2-7: name of external symbol referred to in this control section
Col. 8-73: names of other external reference symbols

New format of the Modification record type

Col. 1: M
Col. 2-7: starting address of the field to be modified
Col. 8-9: length of the field to be modified, in half-bytes
Col. 10: Modification flag (+ or -)
Col. 11-16: External symbol whose value is to be added to or subtracted from the indicated field.

# Figure 2.17

Our earlier definitions required that all of the relative terms in an expression be paired (for an absolute expression), or that all except one be paired (for a relative expression). We must now extend this restriction to specify that both terms in each pair must be relative within the same control section.
When an expression involves external references, the assembler cannot in general determine whether or not the expression is legal. Whether the terms occur in the same control section is unknown at assembly time. In such a case, the assembler generates Modification records so the loader can check the expression for errors.