Table of contents

General notes

Prefixes

Macrooperations

Macrooperations usually begin with BOM, but many do not have this flow marker. It looks like some micro-operations are executed before starting MSROM code, e.g. lds vs lss.

In lds you can see it is using TMP2 without initializing it first:

UROM_348E  macro_lds: 			;  xref:
UROM_348E	           TMP0 =         LOAD.M0.SC1.DSZ64.0(CONST.4       , TMP2          , GDTR    )
UROM_3490	           TMP1 =         LOAD.M0.SC1.DSZ64.0(CONST.4       , TMP2          , LDTR    )
UROM_3491	           TMPB =         USEGOP3        (CONST.6       , TMP2          , DS      )
UROM_3492	           TMP0 =         _CMOV2.NC      (TMP0          , TMP1          )	; Returns arg2 if cond met ? Difference to CMOV unknown
UROM_3494	           TMP1 =         INTEXTRACT.HI32(TMP0          , TMPB          )
UROM_3495	                          USEGOP1        (TMP1          , TMP0          , CONST.04.010  , DS      )
UROM_3496	EOM        REG_ddd =      MOVE.DSZ?      (CONST.0       , TMP3          , OA.4, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */)

In lss you can see similar code, but it is initializing TMP2:

UROM_32A2  macro_lss: 			;  xref:
UROM_32A2	           TMP2 =         LEA.M40.SC1.DSZ32(REG.D0        , REG.D0        , CONST.05.008  , OA.9, U1.1)
UROM_32A4	           TMP3 =         LOAD.M40.SC1.DSZ?.0(REG.D0        , REG.D0        , CONST.05.008  , OA.D, U1.1)
UROM_32A5	           TMP2 =         LOAD.M40.SC1.DSZ16.0(CONST.0F.010  , TMP2          , OA.8, U1.1)
UROM_32A6	           TMP0 =         LOAD.M0.SC1.DSZ64.0(CONST.04.002  , TMP2          , GDTR    )
UROM_32A8	           TMP1 =         LOAD.M0.SC1.DSZ64.0(CONST.04.002  , TMP2          , LDTR    , U2.08)
UROM_32A9	           TMP0 =         _CMOV2.NC      (TMP0          , TMP1          )	; Returns arg2 if cond met ? Difference to CMOV unknown
UROM_32AA	           TMP1 =         INTEXTRACT.HI32(TMP0          , CONST.0       )
UROM_32AC	                          USEGOP4        (TMP1          , TMP0          , CONST.04.02A  , SEG_02  , U2.28)
UROM_32AD	                          UOP.0D4        (CONST.0       , CONST.0       )
UROM_32AE	           REG_ddd =      MOVE.DSZ?      (CONST.0       , TMP3          , OA.4)
UROM_32B0	EOM                       USEGOP2        (CONST.04.005  , CONST.04.005  , SS      , U2.20, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */)

Cryptographic routines

sha1_firsttransfrom

Returns same as sha1_transform, but with SHA-1 context initialisation

sha1_transform

tmp5=ptr to block (64 bytes); tmp6= blockcount? tmp7 = context

returns: pseudohash in tmp&tmp7dwords: tmpa=tmp7[0], tmpb, tmp0, tmp1, tmp8=tmp7[4]

It is not known how sha1 finalisation is performed.

rc4_crypt

TMP5=pointer to data TMP6=datacount (UOP.3B4(TMP6,TMPC ???) ); TMP7=keyptr

Dummy buffer is provided if data are to be discarded (initial 0x200-cycle stream roll).

Microcode update

patch_rc4_keysetup

tmp2 = patchsize; TMP3=PATCHptr; tmp7 = keyptr;

FPU macrooperations

FXSAVE/FXRSTOR

Macrooperation saves/restores the x87 FPU, MMX technology, XMM, and MXCSR registers from the 512-byte memory image specified in the source operand.

part_macro_fxsave_xmm (686, 6D8, ...) is using weird register map:

UROM_260D	           TMPD =         ADD.DSZ8       (TMPD          , CONST_16+0A0  , OA.4)
UROM_260E	                          UOP.432        (CONST_0       , AL  /*XMM0LO*/          , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */)
UROM_2610	                          UOP.F46        (CONST_4       , TMPD          , OA.8, U1.1)

UROM_2611	                          UOP.432        (CONST_0       , TMP6 /*XMM0HI*/         , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */)
UROM_2612	                          UOP.F46        (CONST_04+008  , TMPD          , OA.8, U1.1)
.org 0x2614
.utripletbits 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100

UROM_2614	                          UOP.432        (CONST_0       , CL /*XMM1LO*/            , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */)
UROM_2615	                          UOP.F46        (CONST_04+010  , TMPD          , OA.8, U1.1)
UROM_2616	                          UOP.432        (CONST_0       , TMP7 /*XMM1HI*/         , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */)
UROM_2618	                          UOP.F46        (CONST_04+018  , TMPD          , OA.8, U1.1)
UROM_2619	                          UOP.432        (CONST_0       , DL            , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */)
UROM_261A	                          UOP.F46        (CONST_04+020  , TMPD          , OA.8, U1.1)

SSE patents.google.com/patent/US6721866B2

FXSAVE patent US6898700B2

Info on part_macro_fxsave_alt:

It's probably an optimization of something like "fast strings" - IA32_MISC_ENABLE bit0, except for FPU... there is documentation on the ROB_CR_BKUPTMPDR6 website and it talks about bit 2 faststrings which tests it. It's probably MSR 0x1e0, I can see if we have it set to 0x4 on Pentium.

It probably speeds up operations somehow and it's also a different ordering... I've probably seen it somewhere and we've probably written about it... I'll try to find it.

Info on REG_D0: All FXSAVE/FXRSTOR/LDMXCSR/STMXCSR only allow memory operands in any form. i.e. modrm != 11 from REG_D0 is it possible to get the resulting address via LEA.M40.SC1.DSZ32(REG.D0 , REG.D0 , CONST_05+008 , OA.9) Maybe CONST_05/OA.9 plays a role...

I don't know what the numerical value in REG.D0 is. There could be a modrmbyte there

Offset Size(bytes) Field Description
0x000 2 FCW x87 FPU Control Word
0x002 2 FSW x87 FPU Status Word
0x004 2 FTW x87 FPU Tag Word
0x006 2 FOP x87 FPU Opcode
0x008 4 FPU_IP x87 FPU Instruction Pointer Offset
0x00C 2 FPU_CS x87 FPU Instruction Pointer Selector
0x00E 2 RESERVED Reserved
0x010 4 FPU_DP x87 FPU Data Pointer Offset
0x014 2 FPU_DS x87 FPU Data Pointer Selector
0x016 2 RESERVED Reserved
0x018 4 MXCSR MXCSR Register State
0x01C 4 MXCSR_MASK MXCSR Mask
0x020 16 ST0/MM0 x87 FPU / MMX Register 0
0x030 16 ST1/MM1 x87 FPU / MMX Register 1
0x040 16 ST2/MM2 x87 FPU / MMX Register 2
0x050 16 ST3/MM3 x87 FPU / MMX Register 3
0x060 16 ST4/MM4 x87 FPU / MMX Register 4
0x070 16 ST5/MM5 x87 FPU / MMX Register 5
0x080 16 ST6/MM6 x87 FPU / MMX Register 6
0x090 16 ST7/MM7 x87 FPU / MMX Register 7
0x0A0 16 XMM0 XMM Register 0
0x0B0 16 XMM1 XMM Register 1
0x0C0 16 XMM2 XMM Register 2
0x0D0 16 XMM3 XMM Register 3
0x0E0 16 XMM4 XMM Register 4
0x0F0 16 XMM5 XMM Register 5
0x100 16 XMM6 XMM Register 6
0x110 16 XMM7 XMM Register 7
0x120 224 RESERVED Reserved (Padding to 512 bytes)

See wiki.osdev.org/SSE#FXSAVE_and_FXRSTOR

FXSAVE Instruction Flow

Overview

FXSAVE saves x87 FPU, MMX, and XMM register state to a 512-byte memory area. The instruction requires 16-byte alignment and performs validation before saving.

Execution Flow

Special Cases

Register Split for x87 Stores

XMM Register Split

Weird LOAD/STA/STRD fp opcodes

restore

UROM_26CA	           TMP2 =         UOP.BC9        (CONST_04+030  , TMPB          , OA.8, U1.1)
UROM_26CC	           TMP3 =         UOP.B49        (CONST_04+038  , TMPB          , OA.8, U1.1)
UROM_26CD	           ST1 =          UOP.0E3        (TMP3          , TMP2          )

UROM_26EE	           AL =           UOP.849        (CONST_4       , TMPB          , OA.9, U1.1, U4.00008001 /* U4:0000 0000 0000 0000 1000 0000 0000 0001 */)
UROM_26F0	           TMP6 =         UOP.849        (CONST_04+008  , TMPB          , OA.9, U1.1, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */)


save (strd je 430|380 = maska 7B0; sta je 0x800|380=maska B80)

UROM_0869	                          STRD.DSZ32     (TMP4          , CONST_0       )
UROM_086A	                          STA.M40.SC1.DSZ32(CONST_04+010  , TMPD          , OA.8, U1.1)


UROM_087A	                          UOP.7B3        (ST0           , CONST_0       )          // STRD pro floatpoint!!!
UROM_087C	                          UOP.F46        (CONST_04+020  , TMPD          , OA.8, U1.1) // nejake divne STA??
UROM_087D	                          UOP.5B3        (ST0           , CONST_0       )          // druha cast ST0
UROM_087E	                          UOP.F46        (CONST_04+028  , TMPD          , OA.8, U1.1) // nejake divne STA??

...

UROM_1FA2	                          UOP.432        (AL            , CONST_0       , U4.00008002 /* U4:0000 0000 0000 0000 1000 0000 0000 0010 */)
UROM_1FA4	                          UOP.F46        (CONST_4       , TMPD          , OA.8, U1.1)
UROM_1FA5	                          UOP.432        (TMP6          , CONST_0       , U4.00000002 /* U4:0000 0000 0000 0000 0000 0000 0000 0010 */)
UROM_1FA6	                          UOP.F46        (CONST_04+008  , TMPD          , OA.8, U1.1)

ENTER

ENTER has imm16 (allocation) and imm8 (nesting).

- M_IMM loads what should be nesting level

- REG.31 might be allocation size

@macro_entry_ENTER:
TMP1 := 0                             // Initialize frame size counter

// Push current EBP onto stack
[SS:ESP - operand_size] := EBP        // STA: PUSH EBP
TMP2 := ESP - operand_size            // TMP2 = new ESP after PUSH

// Get nesting level from instruction immediate
TMP0 := M_IMM                         // Assumed: first immediate (alloc_size)
// or REG.31 contains nesting level?
TMP0 := TMP0 AND 0x1F                 // Mask to 0-31 valid range

if (TMP0 == 0) goto @nesting_zero

@nesting_nonzero:
// Nesting level >= 1: setup for frame pointer copying
TMP5 := ESP - operand_size            // TMP5 = temporary frame pointer
TMP2 := TMP2 - operand_size           // Reserve space for temp frame ptr
TMP0 := TMP0 - 1                      // Decrement nesting level
if (TMP0 == 0) goto @nesting_one

// Nesting level >= 2: copy previous display (frame chain)
TMP3 := EBP - operand_size            // TMP3 = source ptr (previous frames)

@frame_copy_loop:
// Copy (nesting_level - 1) frame pointers from previous frame
TMP4 := [SS:TMP3]                     // Load previous frame pointer
TMP3 := TMP3 - operand_size           // Move to next source frame ptr
[SS:TMP2] := TMP4                     // Push frame pointer to new frame
TMP2 := TMP2 - operand_size           // Adjust destination pointer
TMP0 := TMP0 - 1                      // Decrement loop counter
if (TMP0 != 0) goto @frame_copy_loop

@nesting_one:
// Nesting level was 1: push new frame pointer
TMP0 := EIP + REG.31                  // Calculate return address/offset?
// (REG.31 might be alloc_size immediate)
UOP.0D8(TMP0)                         // Unknown - possibly update internal state
[SS:TMP2] := TMP5                     // Push temp frame pointer
goto @finalize_frame

@nesting_zero:
// Nesting level was 0: no frame pointers to copy
TMP0 := EIP + REG.31                  // Calculate address/offset
UOP.0D8(TMP0)                         // Unknown operation

@finalize_frame:
// Common exit: set new frame pointer and allocate local space
UOP.C0B(TMP1, TMP2, SS)              // Unknown - stack validation/limit check?

EBP := ESP - operand_size             // EBP = new frame pointer (points to saved EBP)
ESP := TMP2 + TMP1                    // ESP = final stack pointer after allocation
// (TMP1=0, so ESP=TMP2=final position)

Input

Temporaries

See also

MASKMOVDQU:

UOP.000(ALIAS.014, ALIAS.014)               // Synchronization point

// Read mask register (xmm_sss) into TMP registers
TMP0 := MOVEFROMXMM(REG_xmm_sss_l)          // Mask bytes 0-7 (low 64 bits)
TMP1 := MOVEFROMXMM(REG_xmm_sss_h)          // Mask bytes 8-15 (high 64 bits)

// Validate destination address and prepare streaming store
UOP.F4B(EDI, offset=8)                      // Validate DS:[EDI] and DS:[EDI+8]

// Masked store of low 64 bits (bytes 0-7)
DS:[EDI+0..7] := REG_xmm_ddd_l & TMP0       // Write masked by TMP0

// Masked store of high 64 bits (bytes 8-15)
DS:[EDI+8..15] := REG_xmm_ddd_h & TMP1      // Write masked by TMP1

It's possible the code handles MASKMOVQ as well (see OA.8 in actual code, msrom-6d8 0x3394)

See www.felixcloutier.com/x86/maskmovdqu

WRMSR

Info from CPUID 652

Patch loading

Microcode updates are delivered through a write to MSR 0x79 (IA32_BIOS_UPDT_TRIG), which is the interface described in the Intel Software Developer Manual. Inside wrmsr_core, the MSR address in TMP0 is compared against 0x79 before the normal MSRPLA dispatch path executes. On a match, control transfers to wrmsr_patch_load rather than performing a standard MSR write.

The first instruction of wrmsr_patch_load calls UOP.203, which is observed elsewhere in the microcode as a privilege level check. If the check fails, execution transfers to generic_macro_fault_gp, which raises #GP(0) via SIGEVENT. This matches the architecturally documented behavior that WRMSR from CPL greater than 0 raises a general protection fault.

The value in TMP1 at this point carries the linear address of the update data buffer. This is consistent with the architectural convention for IA32_BIOS_UPDT_TRIG where EAX holds the address of the update data in memory, though the exact register routing from the WRMSR decode path was not fully traced and should be treated as an inference.

@wrmsr_patch_load_scan

TMP7 is set to zero, initializing the error accumulator that will be used throughout the remainder of the procedure.

TMP0 is then loaded from internal control register CR[0x1BE], whose precise architectural role is not directly recoverable from this code. BTR is applied to bit 3 of TMP0. If bit 3 was already clear, carry is not set and execution falls through immediately to @wrmsr_patch_load_derive_auth_key.

If bit 3 was set, a 64-bit word is loaded from linear memory at the address in TMP1. Bit 8 of the low 32 bits of this word is then tested. If bit 8 is set, TMP0 is forced to zero and the loop restarts from the top, where BTR on a zero value will produce no carry and redirect to @wrmsr_patch_load_derive_auth_key on the next pass. If bit 8 is clear, TMP0 is computed as CONST.14.03C + EDX and execution jumps forward to @wrmsr_patch_load_init_msram_write, bypassing the authentication key derivation entirely.

The precise semantics of CR[0x1BE] bit 3 and the role of bit 8 in the memory word cannot be determined from this code alone without additional context about what writes those values. The branch to @wrmsr_patch_load_init_msram_write that skips authentication is notable and may represent a fast path for a previously validated blob, but this is speculative.

@wrmsr_patch_load_derive_auth_key

This routine constructs a stepping-specific authentication reference value.

Two consecutive 32-bit reads are performed from the internal microcode store bus. The address written to MS_CR_ADDR is formed by concatenating the byte at internal address CONST.0E.036 with the value 0xFC. The two successive reads from MS_CR_DATA yield a 64-bit seed value whose higher word is read into TMP0.

TMP0 is then rotated left by the value in CR_STEPPING. The result has 6 added to it, producing an intermediate value in TMPC. This is then added to the low 32 bits of the first 64-bit quadword loaded from the update buffer at [TMP1], and the sum is masked with 0x9C, yielding a 7-bit index. This index is passed to FREADROM, which reads a value from a table embedded in the microcode ROM itself. The result is stored in TMP6 as the expected authentication reference.

The rotation by CR_STEPPING is the mechanism that makes authentication stepping-specific. The same input data produces a different FREADROM index on different steppings, and therefore a different expected reference value. An update blob that passes authentication on one stepping will produce a divergent expected value on a different stepping and fail.

After the reference value is established, TMPB is set to 0x54 and TMP0 is loaded from CONSTROM.03C. These values serve as the MSRAM write count and starting address respectively, used in the subsequent phase.

@wrmsr_patch_load_init_msram_write

TMP1 is advanced by 8, moving the buffer pointer past the header quadword. Two MS_CR_ADDR-based write destination values (0x1BC and 0x1BD, corresponding to MS_CR_ADDR and MS_CR_DATA respectively) are conditionally loaded into TMP9 using a MERGE and CMOV sequence that checks both TMP7 (the error accumulator) and TMP2 (a validity flag from the authentication output). If either indicates an error state, TMP9 is set to 0x1FF instead, which is a sentinel value that will cause subsequent writes to target a null sink rather than live MSRAM. This is the mechanism by which a partially authenticated update is prevented from corrupting MSRAM state even if the write loop continues executing.

@wrmsr_patch_load_msram_write_loop, @wrmsr_patch_load_commit_lo32, @wrmsr_patch_load_commit_hi32

The write loop loads 64-bit words from the update buffer at [TMP1] in sequence. For each quadword, the high 32 bits are extracted into TMPA.

If TMP2 indicates the carry flag is set (signaling the cryptographic authenticator has flagged a problem), the load is intercepted and control transfers to sub_patch_cryptfunc before the value is committed. Otherwise, the low 32 bits are written to MSRAM via MOVETOCREG(TMP9, TMP5) in @wrmsr_patch_load_commit_lo32, TMP5 is updated to TMPA, TMPB is decremented, and @wrmsr_patch_load_commit_hi32 writes the high 32 bits and advances TMP1 by 8. The loop continues until TMPB reaches zero.

sub_patch_cryptfunc

This subroutine is called from multiple points during both the MSRAM write phase and the CRBUS update phase. It implements a 37-round (loop count 0x25) stream authenticator.

On entry, TMP4 is loaded with a 64-bit value formed by concatenating TMPC and TMP4. Each round rotates TMPC right by 1. If the shifted-out bit was set (carry clear after ROR), TMPC is XORed with TMP4; otherwise it is XORed with zero. This is a standard Galois LFSR construction where TMP4 holds the feedback polynomial. After 37 rounds, the high 32 bits of TMP4 are XORed into TMPC, then TMP5 is XORed in, then TMP6 (the reference value derived in @wrmsr_patch_load_derive_auth_key) is XORed in, yielding TMP0. TMP6 is then updated to the old TMP5, and TMP5 is updated to TMP0, advancing the running authentication state. The function returns via U_JMP_INDIR to the address saved in TMP3 by the calling TRANSPORTUIP instruction.

Because TMP6 holds the stepping-keyed reference value and it is folded into the running state on every call, every authenticated block in the update blob depends on all previous blocks and on the stepping identity of the target CPU.

@wrmsr_patch_load_crbus_first_verify

After the MSRAM write loop completes, execution continues into the CRBUS update phase. The first entry is handled separately. TMPB has bit 8 cleared via BTR, and the result is used as an index into FREADROM, yielding an expected reference value. This expected value is subtracted from TMP5 (the current authenticator state), and the result is ORed into TMP7. If TMP5 matched the expected value, the subtraction yields zero and TMP7 is unchanged; any mismatch leaves a nonzero residue that will propagate through TMP7 for the remainder of the procedure.

@wrmsr_patch_load_crbus_rmw_loop, @wrmsr_patch_load_crbus_load_addr, @wrmsr_patch_load_crbus_apply_mask, @wrmsr_patch_load_crbus_or_newval, @wrmsr_patch_load_crbus_auth_commit

The CRBUS update phase applies a sequence of authenticated read-modify-write operations to internal control bus registers.

@wrmsr_patch_load_crbus_rmw_loop loads the next 64-bit word from the buffer. If sub_patch_cryptfunc signals an authentication failure the write is suppressed. Otherwise @wrmsr_patch_load_crbus_load_addr extracts the low 16 bits of the word into TMP9 as the target CRBUS address and advances to the next quadword for the mask operand.

@wrmsr_patch_load_crbus_apply_mask reads the current value of the CRBUS register at the address in TMP9. If TMP5 is nonzero a fixed sentinel address 0x16F is used instead, effectively reading from a safe location rather than the intended target. The value read is ANDed with the mask from TMP5 to produce a masked current value in TMP8.

@wrmsr_patch_load_crbus_or_newval ORs the new value bits into TMP8. TMPB is decremented and the next quadword is loaded for the authentication check.

@wrmsr_patch_load_crbus_auth_commit performs the final per-entry authentication: it reads an expected value from FREADROM using the current buffer word, subtracts TMP5, and ORs any discrepancy into TMP7. If authentication passed and TMP7 is still zero, MOVETOCREG(TMP9, TMP8) commits the masked write to the CRBUS register. The loop then continues from @wrmsr_patch_load_crbus_rmw_loop until TMPB reaches zero.

The CRBUS writes produced by this loop include writes to MS_CR_MATCHPATCH0 (0x1B8), MS_CR_MATCHPATCH1 (0x1B9), and MS_CR_MATCHPATCH2 (0x1BA). These registers hold the µop-cache fetch addresses that the front end will intercept and redirect to the newly loaded patch content. Writing these registers is therefore the act that activates the patch. Any previously installed patch whose match addresses are overwritten by this write sequence is implicitly deactivated at the same moment, since the match registers no longer point to it.

Error Recovery: @wrmsr_patch_load_invalidate_retry

If the TMP2 carry flag indicates an authentication failure at the end of the CRBUS write phase, execution reaches @wrmsr_patch_load_invalidate_retry. The value 0x3B0 (0xEC shifted left by 2) is subtracted from TMP1, rewinding the buffer pointer, and execution jumps back to @wrmsr_patch_load_derive_auth_key to attempt re-authentication from a different offset. This path also sets the BTS bit on TMP2 to force sub_patch_cryptfunc to produce a known initial state before retry.

If TMP7 is nonzero at the end of the CRBUS phase (meaning at least one authenticated comparison failed), a separate error exit path at 0x18F5 is taken before reaching @wrmsr_patch_load_invalidate_retry. This path writes 0x1FF to MS_CR_MATCHPATCH0, MS_CR_MATCHPATCH1, and MS_CR_MATCHPATCH2. 0x1FF is the all-ones value for the 9-bit match register field, which the front end treats as no-match, meaning no µop-cache address will be intercepted. An additional internal register at 0x1BB is also written with 0x1FF. This sequence ensures that any partial CRBUS state written during a failed update attempt is fully neutralized before the handler returns, leaving the CPU in the same functional state as if no update had been attempted.

iret

Pentium M

SUMMARY OF IRET PATHS

IRET entry
│
├─[size check fail]───────────────────────────────→ #GP
│
├─[NT=1]──→ @macro_iret_tss_link
│              → sub_tss_save       (save all regs to old TSS)
│              → @tail_tss_load_continue (clear busy bit in GDT)
│              → tail_tss_load      (load all regs from new TSS)
│                   ├─[CR3 changed] → sub_tlbflush_and_a20
│                   │      ├─[PAE]  → sub_pae_pdpte
│                   │      └─────────────────────────┐
│                   ├─[descriptor invalid]→ #GP      │
│                   └─────────────────────→ @macro_iret_exit
│
├─[pop EIP/CS/EFLAGS] + [EFLAGS merge]
│     ├─[illegal flags]───────────────────────────→ #GP
│     ├─[VM=1]────────→ @macro_iret_v86_return
│     │                     → loc_209E → loc_20A6 → @macro_iret_exit
│     └─[VM=0]────────→ @macro_iret_samepriv
│                           ├─[bad CS]────────────→ #GP
│                           ├─[priv change]
│                           │     → @macro_iret_privchange
│                           │     → loc_20A6 → @macro_iret_exit
│                           └─[same priv]
│                                 → EOM: SIGEVENT 0xE7 (done)
│
└─ @macro_iret_exit: UOP.0D8 commits new EIP to front-end.

How It Works

Based on macro_iret-msrom-6d8.asm.

Subroutine names reflect labels in the microcode listing.

All paths originate at UROM_1BE0 (macro_iret).
Indented "→ sub" means a TRANSPORTUIP-based microcode call (callee
returns via JMP_INDIR back to the encoded return address).


==========================================================
PART 1 — ENTRY AND DISPATCH
==========================================================

macro_iret  (UROM_1BE0)
-----------------------
1. Read an internal size/mode state via UOP.204(CR_at_0E.004).
Subtract REG_OP_Size (the current operand-size attribute).
If the result is zero: the frame size is inconsistent → #GP.
[This checks that the instruction encoding matches the current
stack-frame size expectation.]

2. Write CR_SMM_status = 4  (marks IRET-in-progress for SMM interaction).

3. Read current EFLAGS via UOP.208(TMPA). Test bit 14 (NT flag).
If NT=1 → jump to @macro_iret_tss_link  (hardware task return).
Otherwise continue for normal stack-based IRET.


==========================================================
PART 2 — NORMAL IRET (no NT, no task switch)
==========================================================

Stack frame pop  (UROM_1BEA)
----------------------------
1. Compute stack-pointer stride (TMP9) for 16/32-bit operation size.
2. Speculatively load three values from SS:ESP:
TMP7 = new EFLAGS  (DSZ? = operand-size wide)
TMP2 = new CS selector  (always 16-bit)
TMP3 = new EIP  (DSZ? = operand-size wide)
These three are the architectural IRET frame (EIP, CS, EFLAGS in
stack order low→high).

EFLAGS permission merge  (UROM_1BF4)
--------------------------------------
Determines which EFLAGS bits this IRET may change, based on CPL and
IOPL. The logic applies the SDM rules:
- CPL=0 may change any flag.
- CPL>0 but CPL≤IOPL may change IF.
- CPL>IOPL may not change IF or IOPL.
- VM flag changes are restricted to CPL=0.

Implementation steps:
a. Load base permission mask 0x00254FD5 from ROM.
b. UOP.202: select/merge with alternate mask 0x00254DD5 (differs in
the IF bit position).
c. UOP.203(CONST.14.13E): apply IOPL/CPL comparison from internal
state table to further restrict the mask.
d. UOP.209(CONST.14.0BD): apply VM-mode filter.
e. AND result to 0x1FF (lower EFLAGS bits only, this pass).
f. UOP.204(CONST.14.125): final mask read/merge from state table.
g. Separately compute the VM-flag transition permission:
TMP8 = TMP7 << 10 (shift EFLAGS copy)
TMPB = UOP.204(TMP8, TMP7) & 0x00080000
(isolates the VM bit change: old VM XOR new VM → detect 0→1)
BTR TMP7, bit 19 (clear VM in the to-be-committed EFLAGS copy)
TMP7 = TMPB | TMP7  (re-insert corrected VM)
h. Final merge:
writable = TMP5 (mask), frozen = ~TMP5
TMP7 = (TMP5 AND new_flags) OR (~TMP5 AND SystemFlags)
i. Check that VIF|VIP are legal in the result; if both clear → #GP.
j. Commit: SystemFlags = TMP7.

VM check:  (UROM_1C0D)
UOP.201 extracts VM flag from TMP7.
If VM=1 in new EFLAGS → jump to @macro_iret_v86_return.
Otherwise → jump to @macro_iret_samepriv.


==========================================================
PART 3 — SAME-PRIVILEGE PROTECTED-MODE RETURN
==========================================================

@macro_iret_samepriv  (UROM_2E58)
----------------------------------
TMPC = 0 (clear, means same-privilege).
TMP2 = new CS selector (already loaded from stack in Part 2).

1. Load the 64-bit GDT entry for TMP2 from GDTR.
2. Load the 64-bit LDT entry for TMP2 from LDTR.
3. UOP.263: select GDT or LDT entry based on TI bit in selector.
4. USEGOP4(type=0x59): validate as a code segment descriptor.
If overflow (bad descriptor) → UROM_3A00 (#GP).
5. UOP.CC1: signal pending CS load to segment unit.
6. Advance ESP: ESP_20 += TMP9  (pop the IRET frame off the stack).
7. Test REG.37 bit 18 (privilege-change flag):
If set → jump to @macro_iret_privchange  (outer ring, need SS:ESP pop).
8. UOP.62A(0E.102, TMPB): commit new CS descriptor into descriptor cache.
9. UOP.0D8(TMP0, TMP3): redirect instruction fetch to new CS:EIP.
10.UOP.0D4(0, 0): finalize redirect (pipeline signal).
11.Update REG.31 with new CS generation counter.
12.REG.37 = 0xFF (reset internal state flags).
13.USEGOP2(CS, type 4): mark CS as accessed.
14.EOM: SIGEVENT(TMP3, 0xE7)  → instruction architecturally complete.


==========================================================
PART 4 — INTER-PRIVILEGE RETURN (outer ring)
==========================================================

@macro_iret_privchange  (UROM_209D)
------------------------------------
Reached when the new CS selector has a higher RPL than the current CPL
(returning to less-privileged code). After the normal CS pop the stack
also contains SS:ESP for the outer ring.

1. UOP.62A(0E.102, 0E.102): commit pending CS state.
2. Read new CS RPL from descriptor cache → update REG.31 generation.
3. REG.37 = 0xFF (reset state).
4. USEGOP2(CS, type 4): finalize CS descriptor cache.
5. USEGOP2(SS, type 5): finalize SS descriptor cache (already loaded
from the extended inter-privilege frame on the stack).
[The SS:ESP values are loaded earlier in the stack-pop sequence via
@macro_iret_v86_return or the priv-change branch of samepriv.]
Falls through to loc_20A6.


loc_20A6  (UROM_20A6)  — task-switch / inter-priv serialization
----------------------------------------------------------------
1. STRD(0,0) + UOP.134: pipeline drain and memory fence.
2. Read internal reg 0E.090, set bit 1 (mark serialization in progress).
3. Spin loop (loc_20AD): poll internal reg 0E.022 bit 5 until set
(serialization acknowledged by hardware).
4. Fall through to @macro_iret_exit.

@macro_iret_exit  (UROM_20B1)
------------------------------
1. Fl2.Fl3: SIGEVENT(TMP3, 0xE7)  → instruction architecturally complete.
2. UOP.20A(CONST.14.109): read VIF/VIP permission mask.
3. AND with committed SystemFlags; subtract 0x00180000 (VIF|VIP mask).
If zero → #GP  (VIF/VIP inconsistency in final check).
4. EOM.Fl2: UOP.0D8(0, TMP3)  → redirect front-end to new EIP.
(This is the task-switch exit EIP commit.)


==========================================================
PART 5 — RETURN TO V86 MODE (VM=1 in new EFLAGS)
==========================================================

@macro_iret_v86_return  (UROM_2C95)
-------------------------------------
Reached when new EFLAGS has VM=1 (returning to a Virtual-8086 task).
The interrupt frame for a V86 return is larger than the standard 3-word
frame: it additionally contains on the stack (high to low):
GS, FS, DS, ES, SS, ESP, EFLAGS, CS, EIP

1. Save old ESP (TMP5 = ESP).
2. BTS REG.37 bit 17 (mark V86 transition).
3. Advance ESP_20 by TMP9 (skip past the basic EIP/CS/EFLAGS).
4. Pop additional frame words from SS:ESP:
TMP4  = new ESP  (operand-size wide)
TMP8  = new SS selector (16-bit)
TMP1  = new ES selector
TMP0  = new DS selector
TMP9  = new FS selector
TMP6  = new GS selector
5. Build CS descriptor:
WRSEGFLD(CS, selector=TMP2)
USEGOP4(CS, type 9)  (V86 code segment)
UOP.62A(0E.102, TMPB): commit CS.
6. Load DS, ES, FS, GS via USEGOP3/USEGOP1 (real-mode style, no
descriptor validation — V86 segments are base=selector*16).
7. TMP3 = zero-extended TMP3 (EIP is 16-bit in V86).
8. ESP_OPSZ = TMP4 (restore new outer-ring ESP).
9. CR_CPL = 3  (V86 mode always runs at CPL=3).
10.Load SS: WRSEGFLD(SS, TMP8); USEGOP4(SS, type 0x0A).
11.Jump to loc_209E (continue at inter-privilege finalization).


==========================================================
PART 6 — HARDWARE TASK SWITCH (NT=1)
==========================================================

Overview: IRET with EFLAGS.NT=1 is not a stack-pop but a full hardware
task switch back to the previous task (identified by the back-link field
at offset 0 of the current TSS). Three subroutines are involved:
sub_tss_save      — save current CPU state into current TSS
@tail_tss_load_continue — clear busy bit of old TSS, then load new TSS
tail_tss_load     — load new task state from new TSS


@macro_iret_tss_link  (UROM_21BD)
-----------------------------------
1. Set up REG.37 task-switch flags.
2. Read current TR selector.
3. Load word at TR:0 (the back-link field = previous task's TSS selector).
4. Load the GDT descriptor for the back-link selector.
5. USEGOP4(type=0xA2): validate it is a TSS descriptor.
6. TRANSPORTUIP → @tail_tss_load_continue:
Encode the return address for after sub_tss_save+tail_tss_load_continue
completes. This is the microcode "call" mechanism.
7. TMP4 = EIP_30 + REG.31  (current EIP for saving into TSS).
8. TMP9 = TMPA & 0x003F3FD7  (mask current EFLAGS for saving).
9. CR_SCP15 = 16  (record current operand size for TSS stride calculation).
10.Jump to sub_tss_save.

------------------------------------------------------------------------
SUBROUTINE: @tail_tss_load_continue  (UROM_3100)
------------------------------------------------------------------------
Clears the busy bit (B-bit) in the old task's TSS GDT descriptor.
This marks the old task as no longer running, allowing it to be entered
again in the future.

1. TMP2 = current TR selector value (old task).
2. TMP7 = TMP2 & 0x1F8  (byte offset of descriptor in GDT = index × 8,
lower bits masked off).
3. UOP.F0B(CONST.6, TMP7, GDTR): acquire bus lock on the GDT entry.
4. STRD(0,0) + UOP.134: drain store buffer, serialize pipeline.
5. XLOAD.DSZ64.1(GDTR, TMP2): atomically load 64-bit old-TSS descriptor.
6. Extract high 32 bits; BTR bit 9 (= bit 41 of the descriptor = B bit
in the type field of a TSS descriptor, SDM §3.5).
7. Recombine 64-bit descriptor with B-bit cleared.
8. STRD(modified_descriptor) + STA.DSZ64.1(GDTR, TMP7):
atomically write back the modified descriptor.
9. TMP6 = new-task TSS base address (from USEGOP2 on the back-link selector).
10.TMP5 = 0  (clear temporary).
11.Jump to tail_tss_load.


------------------------------------------------------------------------
SUBROUTINE: tail_tss_load  (UROM_1056)
------------------------------------------------------------------------
Loads the new task's CPU state from its TSS and performs final setup.
(Also the target of TRANSPORTUIP return from @tail_tss_load_continue.)

Setup:
1. CR_SCP14 = EIP_30  (save current EIP for debug).
2. Set DebugCtlMSR.BTF (bit 14)  — arm Branch Trap Flag so the first
branch in the new task triggers a #DB if single-step was active.
3. REG.27 = 0  (will accumulate new TSS base).
4. Read new-task TSS descriptor type field (TMPC).
5. BTEST TMPC bit 11: 0 = 286-style (16-bit) TSS, 1 = 386+ (32-bit) TSS.
6. TMP9 = stride: 2 (16-bit TSS) or 4 (32-bit TSS).
7. Save TSS-type flag to CR 0E.109.

8. Spin-wait (ALIAS.065) — wait for TSS read access.
9. UOP.1CA — synchronization point.
10.Compute TSS body start offset TMP7 (depends on 16/32-bit TSS type).
11.If 32-bit TSS: load CR3 from TSS offset 0x1C → TMP8.

12.Sequential load loop from new-task TSS body:
Each field: LOAD 16-bit value from TR:TMP7; TMP7 += TMP9.
Fields loaded in order:
TMP3 = new EIP  → SIGEVENT(TMP3, 0x5C) commits EIP_30.
TMP4 = new EFLAGS (lower 16 bits)
EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI
(each: zero-fill upper half via MOVE(0x1FF), then load low word)
SystemFlags = EFLAGS merge (BTR RF, then set from TMP5/TMP4)
CR0 = BTS(CR0, bit 1)  (set MP flag for new task)
ES selector → load and write ES descriptor fields
CS selector → extract RPL → write to CR_CPL → write CS descriptor
SS selector → write SS descriptor
DS selector → write DS descriptor
If 32-bit TSS: FS selector, GS selector
If 16-bit TSS (286): skip FS/GS → force FS=GS=null at loc_10CD.
LDTR selector → write LDTR descriptor (type 7 = LDT)

CR3 / paging check:  (UROM_10DD)
Only if 32-bit TSS and CR0.PG=1:
If new CR3 == current CR3 → skip (no TLB flush needed).
Otherwise write new CR3, then:
TRANSPORTUIP → loc_10F0 (encode return address for after flush)
Check CR4 for PAE or PSE:
If PAE/PSE active → call sub_pae_pdpte (validate PDPTEs first).
Otherwise fall into loc_10F0 directly.

loc_10F0:  (UROM_10F0)
TRANSPORTUIP → loc_10F5 (encode second-level return address).
Jump to sub_tlbflush_and_a20.
Return here after TLB flush; re-read TSS type into TMPC.

Descriptor validation:  (UROM_10F8)
After loading and (if needed) flushing TLB, validate all loaded
segment descriptors against the GDT/LDT:

1. USEGOP2(CS), USEGOP2(SS): verify CS/SS descriptor cache consistency.
2. Load LDTR → validate LDT descriptor (type 0xE7).
3. Commit SystemFlags = TMP5 (new EFLAGS fully applied).
4. Compute new effective CPL from CS.RPL and EFLAGS.IOPL; write CR_CPL.
5. Validate SS descriptor:
Load 64-bit from GDT+LDT; UOP.263 selects correct table.
USEGOP4(type=0x2A): must be a writable data segment.
USEGOP2(SS, type 5): finalize SS descriptor cache.
6. Validate DS: load both GDT+LDT entries; UOP.263 selects.
USEGOP3/USEGOP1(DS, type 0x10): data segment.
7. Validate ES: same pattern.
8. If 32-bit TSS: validate FS, GS the same way.
9. Validate CS (loc_114A):
Load both GDT+LDT entries; UOP.263 selects.
USEGOP4(type=0x79): must be a code segment.
USEGOP2(CS, type 4): mark accessed.
UOP.62A(0E.102, TMPB): commit CS descriptor.
UOP.CC1: signal CS load.
Compute REG.31 (CS descriptor generation counter).

Error-code push check:  (UROM_1160)
Read CR_SCP15 bits [5:4] = saved IOPL comparison.
If the task switch was triggered by an exception that has an error code
(IOPL bits indicate this), push the error code onto the new stack:
If 32-bit TSS: push 32-bit error code, ESP -= 4.
If 16-bit TSS: push 16-bit error code, ESP -= 2.

Debug register finalization:  (UROM_1175)
1. DR7 &= 0xFFFFFEAA: clear all local breakpoint-enable bits (LE0–LE3).
New task starts with hardware breakpoints disabled.
2. If 32-bit TSS: load word at TR:0x64 (T-bit flag from TSS).
If T-bit set: set DR6.BS (bit 15) → a #DB will fire on first
instruction of the new task (TSS debug-trap feature, SDM §7.3.1).
3. ALTDR6 = (ALTDR6 | T-bit-in-DR6.BS) & ~bit14  (clear BT bit).
4. CR_SMM_status = 1  (mark task-switch complete for SMM).
5. BTR DebugCtlMSR bit 9  (clear BTF — armed at entry, now cleared).

Final dispatch:  (UROM_118E)
Test REG.37 bit 16 (was this triggered by a bus-lock / serialization
path?):
If set → loc_20A6 (serialization loop).
Otherwise → @macro_iret_exit.




==========================================================
FAULT EXIT
==========================================================

@macro_iret_fault_gp  (UROM_35BA)
-----------------------------------
Reached from:
- Entry size check failure (1BE5)
- NT-flag stack-frame validation failure (1BF2)
- Illegal EFLAGS bits (IOPL/VM) after merge (1C0C)
- Final VIF/VIP consistency check failure (20B6)
- Any descriptor validation overflow (via UROM_3A00)
- PAE PDPTE reserved-bit violation (via loc_352C at 352C, which falls
through to 35BA's pattern or signals independently)

1. TMP7 = 0xC1  (exception code: #GP vector 0x0D in P6 internal format).
2. TMP6 = UOP.120(0x0D, 0x0D)  — prepare exception descriptor for #GP.
3. EOM: SIGEVENT(TMP6, TMP7)   — deliver #GP(0) to the exception handler.
Microcode execution ends; hardware begins exception delivery.

sub_tss_save

From Pentium M 6D8

Saves all current CPU architectural state into the current (old) task's TSS.

1. Reset REG.37 flags for save operation.
2. Check new-task TSS descriptor size (16-bit or 32-bit).
UOP.CC1: initialize sequential-write pointer into new-task TSS body.
3. Check current (old) TR descriptor size to determine field stride:
stride TMPA = 2 (for 286-style 16-bit TSS) or 4 (32-bit TSS).
4. Spin-wait (ALIAS.178) — wait for TSS write access to be granted.
5. UOP.1CA — synchronization point after spin.
6. Check REG.37 bit 21: if already saved (re-entry guard) → skip to loc_27CE.
7. BTS REG.37 bit 21 (mark save-in-progress).

8. Sequential store loop into current TSS body (via STRD+STA pairs):
Each field: STRD stages the register value; STA writes it to
TR:TMP7; then TMP7 += TMPA.
Fields written in order (matches SDM Table 7-1):
Offset 0x20 (stride-dependent): EIP
Offset +:    EFLAGS (masked via TMP9)
Offset +:    EAX
Offset +:    ECX
Offset +:    EDX
Offset +:    EBX
Offset +:    ESP
Offset +:    EBP
Offset +:    ESI
Offset +:    EDI
Offset +:    ES selector  (RDSEGFLD → STRD → STA)
Offset +:    CS selector
Offset +:    SS selector
Offset +:    DS selector
If 32-bit TSS:
Offset +:  FS selector
Offset +:  GS selector

9. BTR REG.37 bit 21 (clear save-in-progress).

10. loc_27CE: set up read access to the new task's TSS.
Read TMP2 = base of new-task TSS, TMP5 = descriptor type.
UOP.CC9 × 2: initialize sequential-read pointer into new-task TSS body.
11. JMP_INDIR TMPC — return via TRANSPORTUIP to @tail_tss_load_continue.

sub_tlbflush_and_a20

From Pentium M 6D8

Called via TRANSPORTUIP (return address in TMPE) when CR3 changes.

1. STRD(0,0) + UOP.131(LINSEG): perform full TLB flush.
2. Write internal CR 0E.154 = 0x20 (TLB-flush-complete marker).
3. SIGEVENT(0xCC): signal TLB flush completion to the memory subsystem.
4. Read CR_A20MASK; BTR bit 7; BTR bit 16:
Reset A20 gate override bits — new task uses normal A20 state.
5. Write back CR_A20MASK.
6. JMP_INDIR TMPE → return to TRANSPORTUIP caller (loc_10F0).

sub_pae_pdpte

From Pentium M 6D8

Called when CR4.PAE or CR4.PSE is set and CR3 changes.
Loads and validates the four Page Directory Pointer Table Entries.

1. Set "PAE in progress" flag (internal CR 0E.054 bit 7).
2. Combine new CR3 page-frame (from TMP2 & 0xFFFFF000) with low bits of
current internal PAE register → write to internal CR 0E.051.
3. TMPC = 4  (loop counter: 4 PDPTEs to validate).
4. TMP1 = 0x162  (base index of PDPTE storage CRs).
5. TMP0 = new CR3 & 0x00000FE0  (byte offset into PDPT structure).

6. Validation loop (loc_2AF1):
TMP8 = LOAD.DSZ64.0(LINSEG, TMP0)   ; load one 64-bit PDPTE
TMP0 += 8
Check TMP8 & 0x000001E6:
0x1E6 = bits 8:1,6:3,1 in PDPTE = reserved bits that must be 0.
If any set → jump to loc_352C  (#GP — reserved bit violation).
Store low 32 bits: MOVETOCREG(TMP1, TMP8); TMP1 += 1.
Check high 32 bits & 0x1FF:
Upper bits of PDPTE also reserved.
If any set → jump to loc_352C  (#GP).
Store high 32 bits: MOVETOCREG(TMP1, TMP6); TMP1 += 1.
TMPC -= 1; loop if not zero.

7. Copy loop (loc_2B05): copy all 8 half-words from CRs 0x162–0x169
to CRs 0x063–0x06A (final MMU-visible PDPTE register file):
8 iterations: MOVETOCREG(0x63+i, MOVEFROMCREG(0x162+i)).

8. BTR internal CR 0E.054 bit 7  (clear PAE-in-progress).
9. JMP_INDIR TMP7 → return to TRANSPORTUIP caller (loc_10F0).


The author is not affiliated with, endorsed by, or sponsored by Intel Corporation or its affiliates. All trademarks, including but not limited to Intel, Pentium, and any other registered or unregistered marks mentioned herein, are the property of their respective owners. Their use in this context is solely for descriptive and informational purposes and constitutes nominative fair use under applicable trademark laws.
  • index page
  • uop description
  • ADC
  • ADD
  • AND
  • BSF
  • BSR
  • BSWAP
  • BTEST
  • DIV
  • FANDNOT
  • FCALCTW
  • FCMOV
  • FCOM
  • FMERGE
  • FMOV
  • FPEXTRACT
  • FPORDATATYPE
  • FPSIGNEXT
  • FREADROM
  • FXORS
  • IDIV
  • IMUL
  • INTEXTRACT
  • LEA
  • LOAD
  • MOVE
  • MOVEFROMCREG
  • MOVETOCREG
  • MUL
  • OR
  • PORTIN
  • PORTOUT
  • RCL
  • RCR
  • RDSEGFLD
  • ROL
  • ROR
  • SAL
  • SAR
  • SBC
  • SHL
  • SHR
  • SIGEVENT
  • STA
  • STRD
  • SUB
  • SUBR
  • TRANSPORTUIP
  • UOP
  • U_JCC
  • U_JMP
  • U_JMP_INDIR
  • WRSEGFLD
  • WUCONCAT
  • WUEXTRBK
  • WUINSERT
  • WUMERGE
  • XOR