



# When Things Go Wrong: Abends in Your Assembler Program and How You Can Recover From Them

(All You Need to Know to Write Your First ESTAE)

Vit Gottwald CA Technologies

August 10, 2012





#### **Agenda**

- Introduction
  - Basic Hardware Terms
  - Instruction Execution Loop
  - Interrupts
- Recovery
  - Program Error
  - Recovery/Termination Manager
  - ESTAE
  - z/OS Control Blocks
  - Special Considerations
- References





#### **Basic terms**

- Storage
  - Programs
  - Data
  - Low Core (first 8K of storage)
- CPU
  - 16 General Purpose Registers
  - Program Status Word (instruction pointer)
- Instruction
  - Operation code
  - Operands
  - Length





#### **Instruction Execution Loop**

Sequential



- How does the CPU know the instruction length?
  - First two bits of operation code
    - 00 instruction is 2 bytes long
    - 01 or 10 instruction is 4 bytes long
    - 11 instruction is 6 bytes long





#### **Instruction Execution Loop**



 Branch type instructions replace the instruction address in PSW





#### **Instruction Execution Loop**



## What does the hardware do to handle the interrupt?



- Save into Low Core
  - Current PSW
  - PSW extension
    - Interrupt code
    - Instruction Length Code (ILC)
  - TEABEARdiscussed later
- Load from Low Core
  - New PSW assigned to the type of interrupt that occurred

Hardware INTERRUPT handling





#### **Interrupts**

- Each interrupt type has its own fields in Low Core
  - old-PSW
  - new-PSW
- First Level Interrupt Handler (FLIH)
  - Routine pointed to by instruction address in new-PSW
- Interrupt types
  - Restart, External, Machine Check, I/O
  - SVC
  - Program Check
    - CPU recognized problem in execution of an instruction
    - Categorized by Program Interruption Code (PIC)





#### **Program Interruption Code (PIC)**





## **RTM terminology**





#### **Program error**

- Hardware detected (subset of Program Checks)
  - Results in an OCx ABENDs
  - Not every P.C is a program error
    - e.g. PIC 11 page fault may or may not be a program error
  - FLIH decides whether the program check is or is not an error
  - If the P.C. is considered an error, FLIH passes control to RTM(1)
- Software detected
  - Either a z/OS component or a user program detects that it cannot successfully continue and chooses to terminate abnormally
  - Implemented through ABEND macro call → causes an entry to RTM(2)
  - Typically the ABEND code is in the form xNN
    - NN SVC hex number of the z/OS service detecting a problem





#### Recovery/Termination manager (RTM)

- Receives control early after the discovery of a program error (or when a program ends normally)
- Passes control to appropriate recovery routine (if present)
- If recovery not successful and either of
  - //SYSUDUMP, //SYSABEND, or //SYSMDUMP DD present, requests documentation of the error by calling z/OS dump services (SNAP macro)
- Handles the final termination of the program
  - Closing any open datasets
  - Freeing memory
  - Releasing ENQs

When RTM(2) gets control, RTM2WA control block is created



#### **Recovery routine**

- Responsible for
  - Fixing the error and giving the failing program another chance (retry)
  - Documenting the error, cleaning up resources, and continuing with termination process (percolate)
- Two basic types
  - ESPIE to handle Program Checks with PIC 1-F hex
    - Receives control from RTM(1)
  - ESTAE-like to handle ABENDs
    - Receives control from RTM(2)
    - RTM(1) passes control to RTM(2) through ABEND macro call (OAOD) when last RTM(1) recovery routine percolates



## **Extended Specify Task Abnormal Exit** (ESTAE)



- Established through ESTAE macro
- Gets control from RTM through a SYNCH macro call (OAOC)
- Communicates with RTM via SDWA
- At entry receives pointers to
  - Parameter specified by the user at ESTAE macro call R2
  - System Diagnostic Work Area (SDWA) R1





#### System Diagnostic Work Area (SDWA)

- May not be available, check if R0 equals 0C hex
- Contains the ABEND information
- Can be updated directly or through SETRP macro call
  - Several Fields to read:
    - SDWAABCC, SDWACRC, SDWAEC1, SDWAILC1, SDWAINC1, SDWAGRSV, SDWAFLGS, SDWATRAN, SDWABEA, ...
  - Several Fields to write:
    - SDWASR00 SDWASR15, ...
- IHASDWA macro generates SDWA dsect with comments





#### **SETRP** macro

- Used by recovery routine to communicate with RTM(2)
  - SETRP fills in SDWA fields as specified by the parameters
- Sample usage
  - Choose Whether to retry (RC=4) or percolate (RC=0)
  - Specify the retry address (RETADDR=)
  - Restore retry registers from SDWA (RETREGS=YES)
  - Request/Discard user dump (DUMP=YES/NO/IGNORE)
- See [3] for detailed description





#### **Physical SDWA structure**

- SDWA has extensions
  - SDWAXPAD (X'170') points to an extension made up of pointers to other extensions
- Main body and pointers extension always exist
- The other may not
  - e.g. 64-bit extension present if
    - ESTAEX was used
    - SDWALOC31=YES specified on ESTAE
- Physically the SDWA is ordered:
  - main body
  - recordable extensions
  - pointers extension
  - non-recordable extensions







### **Example to show how it works**





#### **Very Simple Example**

- Establish an ESTAE
- 2. Cause a Program Check by branching to FFFFFFE hex
- 3. Recovery routine gets control and sets retry registers
- 4. Retry
- Disable the ESTAE
- 6. Cause an SOC1 ABEND by DC H'0'
  - ESTAE no longer defined → RTM proceeds with termination
  - Register content displayed in the "diagnostic dump" in file 1





#### Very Simple Sample, cont'd

|       | COPY                                | ASMMSP       | ENABLE STRUCTURED PROGRAMMING MACROS |
|-------|-------------------------------------|--------------|--------------------------------------|
|       | SYSSTATE ARCHLVL=2                  |              | USE Z/ARCHITECTURE INSTRUCTIONS      |
|       | ASMMREL ON                          |              | USE RELATIVE BRANCHING               |
| SAUTH | CSECT                               |              |                                      |
| SAUTH | AMODE 31 ABOVE THE LINE TO GET BEAR |              | ABOVE THE LINE TO GET BEAR           |
| SAUTH | RMODE ANY                           |              |                                      |
|       | STM                                 | 14,12,12(13) |                                      |
|       | LARL                                | 8, RECOVERY  | RECOVERY ROUTINE ADDRES              |
|       | LARL                                | 9, RETRY     | RECOVERY ROUTINE PARAMETER ADDRESS   |
|       | ESTAEX (8),CT,PARAM=(9)             |              | ESTABLISH ESTAE                      |
|       | LHI                                 | 15,-2        | MAX EVEN 31 BIT ADDRESS -> S0C4-11 X |
|       |                                     |              | SEE BOTH TEA AND BEAR                |
|       | BR                                  | 15           | BRANCH TO HELL (PSW USELESS)         |
| RETRY | DS                                  | ОН           |                                      |
|       | ESTAEX 0                            |              | REMOVE THE ESTAE                     |
|       | DC                                  | н'О'         | INVALID OPERATION CODE -> SOC1-01    |





#### Very Simple Sample, cont'd

```
RECOVERY DS 0H
        IF CHI, 0, EQ, X'OC' Q.SDWA MISSING
          WTO 'SDWA MISSING' may change registers 0,1,14,15
          SR 15,15
                                  PERCOLATE
          BR 14
                                  RETURN TO RTM
        ENDIF
        STM 14,12,12(13) SAVE REGISTERS
        LR 3,1
                                  SAVE POINTER TO SDWA
        USING SDWA, 3
                                  MAP SYSTEM DIAGNOSTIC SAVE AREA
                                  see next slide and include here
         SETRP RC=4, RETADDR=(2), WKAREA=(3), RETREGS=YES, FRESDWA=YES
         DROP 3
         LM 14,12,12(13)
                                LOAD REGISTERS
         BR 14
                                RETURN TO RTM
         IHASDWA
                                GENERATE SDWA DSECT
              SAUTH
                                END ASSEMBLY
         END
```





#### Very Simple Sample, cont'd

```
SR 0,0
ST 0,SDWASR00
MVC SDWASR01, SDWAABCC SAVE ABEND CODE IN R1
MVC SDWASR03, SDWATRAN SAVE TRANSLATION EXCEPTION ADDRESS
L 4, SDWAXPAD ADDRESS OF SDWA EXTENSION POINTERS
USING SDWAPTRS, 4
       5, SDWASRVP RECORDABLE EXTENSION
 USING SDWARC1,5
   MVC SDWASR02, SDWACRC SAVE REASON CODE
 DROP 5
 L 6, SDWAXEME 64-BIT EXTENSION
 USING SDWARC4, 6
       SDWASR04, SDWABEA+4 SAVE BREAKING EVENT ADDRESS-31
   MVC
 DROP 6
DROP 4
```





#### **Very Simple Example – retry registers**

- The ESTAE routine set the retry registers as follows
  - Zeros into R0
  - ABEND code into R1
  - Reason code into R2
  - Translation Exception Address into R3,
  - Breaking Event Address into R4





#### **Translation Exception Address (TEA)**

- Location 168-175<sub>10</sub> in Low Core
- Filled in when page or segment translation occurs (PIC 10 and 11)
- Bits 0-51 contain address of the page we tried to access
- Bits 52-63 are undefined, not part of the address!!!
- Provided in SDWA and RTM2WA
  - Low 32 bits provided in SDWATRAN
  - Full 64 bit in SDWA 64 bit extension (SDWATRNE) and also in RTM2TRNE





#### **Breaking Event Address Register (BEAR)**

- 8 bytes long CPU register
- When a branch type instruction is executed, it's address is placed in the breaking-event-address register
- When a program interruption occurs, the current contents of the BEAR is placed into Low Core location 110-118
- Provided in 64 bit SDWA extension (SDWABEA)
- Also available in RTM2BEA
- Priceless for debugging "wild branches"!



END OF SYMPTOM DUMP

JOB01893 IEA995I SYMPTOM DUMP OUTPUT 746

SYSTEM COMPLETION CODE=0C1 REASON CODE=00000001

TIME=04.24.31 SEQ=27557 CPU=0000 ASID=023E

PSW AT TIME OF ERROR 078D0000 BAD00F94 ILC 2 INTC 01

ACTIVE LOAD MODULE ADDRESS=3AD00F48 OFFSET=00000

NAME=GO

DATA AT PSW 3AD00F8E - 00840A3C 0000A70E 000CA774

GR 0: 00000001\_000000000 1: 00000000\_840C4000

2: 00000000\_00000011 3: 00000000\_7FFFF000

4: 00000000\_007BBFE0 7: 00000000\_007FF370

6: 00000000\_3AD00F94 9: 00000000\_B0F8C

A: 00000000\_3AD00F94 9: 00000000\_007FF370

C: 00000000\_80C979D2 D: 00000000\_0006F60

E: 00000000\_80FD04BB F: 00000000\_00000000

END OF SYMPTOM DUMP

JOB01893 IEF450I Y8VSMPL RUNPGM - ABEND=S0C1 U0000 REASON=0000001 747

IEF404I Y8VSMPL - ENDED - TIME=04.24.31

TIME=04.24.31

J0B01893

JOBØ1893 \$HASP395 Y8VSMPL ENDED

Connected to tpx port 23

1/2 NUM 04:26:21 IBM-3278-4-E - A55T3147





#### **Best practices**

- Establish your recovery routine when your routine gets control from system, exit, or other application
- Remove the recovery routine before returning to the caller
- Make sure you free the SDWA
  - e.g. by issuing SETRP FRESDWA=YES
- Learn about TCB and RB chains and how they relate to recovery routines (especially the difference ESTAE vs ESTAI processing)
- Be careful when dealing with Linkage Stack, see IEALSQRY macro





#### **Multiple ESTAEs**

- When your program establishes multiple ESTAEs
- And an ABEND occurs
  - The most recently defined ESTAE routine gets control
  - When it decides to percolate, previously dedined ESTAE gets control
  - 3. Ditto
  - 4. ...
- ESTAE is represented by a STAE Control Block (SCB)
- SCBs form a stack (LIFO) with the newest SCB on the top
- When an ESTAE percolates its SCB is removed from the stack and control is passed to the next one on the top





#### **Other Recovery Routine Types**

- ESTAI
  - Subtask recovery
  - Defined on ATTACH(X) macro with ESTAI= parameter
- Associated Recovery Routine (ARR)
  - Recovery for abends in PC routines
  - Defined on ETDEF macro with ARR= parameter,
  - IEAARR macro
- Functional Recovery Routine (FRR)
  - Recovery in SRB routines, disabled or authorized programs
  - Defined through SETFRR macro,
  - SCHEDULE with FRR=YES, IEAMSCHD with FRRADDR=





#### **Final Tips**

- Recovery should be part of the application design. Adding it later can cause lots of troubles and headaches.
- Read carefully "Providing recovery" in [1], especially the section called "Special considerations" if you plan to code recovery routine for your product.
- If you think you finally understand it, read it again!
- Don not underestimate the subject and write a test for every scenario to make sure you really understand it.



#### References

- [1] MVS Programming Assembler Services Guide (SA22-7605)
- [2] MVS Programming Assembler Services Reference (SA22-7606)
- [3] MVS Data Areas (GA32-0853 GA32-0858)
- [4] Principles of Operation (SA22-7832)
- [5] MVS Control Blocks, Hank Murphy, McGraw Hill 1995
- [JB] Joachim von Buttlar, "System z Architecture", [big, but worth reading, skip the IBM propaganda at the beginning], <a href="http://public.dhe.ibm.com/software/dw/university/systemz/SystemzArchitectureCourse.pdf">http://public.dhe.ibm.com/software/dw/university/systemz/SystemzArchitectureCourse.pdf</a>
- [EJ] Ed Jaffe, How to Make Assembler Programs Easier to Read and Maintain Using Structured Programming Macros, <a href="https://share.confex.com/share/115/webprogram/Handout/Session7175">https://share.confex.com/share/115/webprogram/Handout/Session7175</a>
   Structured Assembler.pdf



# Please do not forget to fill in the evaluation forms.











## Additional content (unsorted)





#### z/OS Dispatcher Control Blocks







## z/OS Dispatcher Control Blocks







#### z/OS control blocks

- Piece of storage that has a meaning to z/OS
- Described in IBM manual "MVS Data Areas, Vol1. Vol6.
  - Not very verbose, useful if you know what you are looking for and are familiar z/OS (MVS) terminology





### z/OS control blocks – PSA, CVT

- Prefix Save Area (PSA)
  - Prefix Area contains several fields that have hard wired addresses in the CPU for interrupt handling. The rest is used by FLIH and various other components of z/OS
  - In z/OS terminology Prefix Area is called Prefixed Save Area
  - Contains pointers to other control blocks
    - Task Control Block (TCB) at offset 21C
    - Address Space Control Block (ASCB) at offset 224
    - Communication Vector Table (CVT) at offset 10
- Communication Vector Table (CVT)
  - Anchor to most if not all z/OS control blocks!





### z/OS control blocks - ASCB, TCB

- Address Space Control Block (ASCB)
  - Represents single instance of virtual storage to z/OS (recall MVS = Multiple Virtual Storage)
  - Usually one ASCB per Job XTCB
- Task Control Block (TCB)
  - Represents unit of work to z/OS (a task)
  - Think of a "task" being a "thread" in PC/UNIX terminology
  - It is an anchor to all resources z/OS allocated on behalf of the task, when TCB is removed, all resources for the task are deallocated





### z/OS control blocks - PRB, SVRB

- Request Block (PRB, SVRB, IRB)
  - While TCB represents a unit of work to z/OS, RB represents a particular item we want z/OS to do on behalf of our task
  - When we request a particular program to be run, Program Request Block is created
  - When our program wants to use operating system services, it issues a suitable SVC and a SerVice Request Block is created
  - External interrupt may generate an asynchronous exit routine to be run (e.g. IRB created for STIMER exit routine)
  - The sequence of the Request Blocks is then called an RB chain, it is chained of a TCB in a reverse order than it was created





#### **RB Chain**

- TCB at offset 0 contains a fullword pointer to the most recently created RB
- Each RB points to the previously created RB
- Last RB in the chain (the first created) points back to the TCB







#### TCB chain

- TCB created by ATTACH macro, DETACH removes
- Program running under a TCB can request further TCBs to be created -> multi-threaded application
- Here the mother task a)
   attached three daughter tasks (subtasks)
   b), c), and d) in the respective order



+88

**TCB** 

TCBLTC (+88) field points to the subtask the current TCB attached last TCBOTC (+84) –not shown on picture- points to the parent task TCBNTC (+80) – points to the task attached previously by parent task





#### How does RTM receive control?

- Through an ABEND macro call (SVC 13 OAOD)
  - Terminates either current TCB or the job step TCB in the current address space
- Through a CALLRTM macro call
  - TYPE=ABTERM
    - a "super" version of ABEND
    - Allows to terminate a (TCB=) in current or other address space
  - TYPE=MEMTERM
    - Terminates an address space without giving control to task level recovery routines and resource managers





### **Recovery/Termination macros**

- CALLRTM
  - TYPE=ABTERM is used by CANCEL operator command
  - TYPE=MEMTERM is used by he FORCE oper. command
  - You definitely want to stay away from it, supervisor state and key 0 is required to do a CALLRTM





### **Recovery/Termination macros**

- ABEND
  - Generates an SVC 13 (OAOD)
  - Also has a branch entry
  - Allows to specify
    - ABEND code (12 bits) separate values for System/User ABEND
    - Reason code (RETURN=, 32 bits) passed to recovery routines
    - Dump options
      - DUMP request a dump
      - DUMPOPT parm. list for the SNAP macro
    - Scope of the ABEND
      - STEP if specified, the job step TCB is terminated, if not specified, the default is to terminate the current TCB





#### RTM1 and RTM2

- RTM is composed of two parts
  - RTM1 aka "System Level RTM"
  - RTM2 aka "Task Level RTM"
- RTM1
  - Entered via CALLRTM (e.g. from FLIH for an erroneous P.C.)
  - Runs under the environment of the failing program
  - ESPIE registers with RTM1 low overhead recovery routine
- RTM2
  - Entered via ABEND macro call either from RTM1 or directly
  - Runs as an z/OS subroutine (RB created 0A0D)
  - ESTAE registers with RTM2 (another RB created when called)





#### **Termination**

- Releasing all resources acquired by the task being terminated
- RTM calls Resource Managers to do the actual cleanup
  - Closing any open datasets
  - Freeing memory
  - Releasing ENQs
  - •
- Performed for both normal and abnormal program end





#### **ESTAE** macro

- Assume you are writing your first ESTAE routine for your very simple program to recover from a B37 system ABEND
- You will use

- EXIT ADDR address of the recovery routine
- PARM\_LIST parameter list passed to the recovery routine when it is invoked by RTM
- CT create as opposed to ○V override an existing ESTAE





## **Virtual Storage**

- Virtual storage
  - Introduced in S/370 in early 1970's
  - Each "application" (address space) can use the full range of addresses available on the architecture independently of all other applications
  - Implemented in hardware via Dynamic Address Translation
  - VIRTUAL ADDRESSES translated into REAL ADDRESSES





## z/Architecture Virtual Storage







## z/Architecture Virtual Storage







### z/Architecture Virtual, Real, Absolute

- How to handle this with multiple CPUs?
- Prefix register
  - 64 bits, bits 0-32 are always 0
  - Used for assigning a range of real addresses 0-1FFF to a different block in absolute storage for each CPU
  - The mechanism is called Prefixing, the storage Prefix Area





# z/Architecture Prefixing







## z/Architecture Prefixing



Say Prefix register value in CPU1 is 6000, then

- Real Addresses 1-1FFF are translated to Absolute Addresses 6000-7FFF
- Real Addresses

   6000-7FFF are
   translated to Absolute

   Addresses 1-1FFF





## z/Architecture Prefixing







# **General Purpose Registers**

- 16 General (Purpose) Registers (GPR 0 15)
  - 64 bits numbered 0 (MSB) 63 (LSB)
  - Integer arithmetic
  - Address generation/calculation







# z/Architecture Program Status Word





## **ESA/390 Program Status Word**

- So far z/OS doesn't support execution of instructions above the 2GB bar (no room in current control blocks to save all 8 bytes of the instruction address upon an interrupt)
- Usually we still deal with the ESA/390 style PSW in dumps and within various z/OS control blocks



| B<br>A | Instruction Address |    |
|--------|---------------------|----|
| 32     |                     | 63 |

63





# **Types of Instruction Ending**

- Completion
  - Successful completion or partial completion (for interruptible instructions at a unit of work boundary – CC=3)
  - PSW points to the next sequential instruction
- Suppression
  - As if the instruction just executed was a no-operation (NOP)
  - contents of any result fields, including condition code are not changed
  - PSW points to next sequential instruction





# Types of Instruction Ending, cont'd

- Nullification
  - Same as Suppression but
  - PSW points to the instruction just executed
- Termination<sup>1)</sup>
  - causes the contents of any fields due to be changed by the instruction to be unpredictable (some may change, other not)
  - The operation may replace all, part, or none of the contents of the designated result fields and may change the condition code
  - PSW points to the next sequential instruction

SHARE in Anaheim