General notes
- macro_ - begin of macrooperation
- sub_ - subroutine, to be used with TRANSPORTUIP
- @... - local labels (subroutine or macro-op)
Macrooperations usually begin with BOM, but many do not have this flow marker. It looks like some micro-operations are executed before starting MSROM code, e.g. lds vs lss.
In lds you can see it is using TMP2 without initializing it first:
UROM_348E macro_lds: ; xref: UROM_348E TMP0 = LOAD.M0.SC1.DSZ64.0(CONST.4 , TMP2 , GDTR ) UROM_3490 TMP1 = LOAD.M0.SC1.DSZ64.0(CONST.4 , TMP2 , LDTR ) UROM_3491 TMPB = USEGOP3 (CONST.6 , TMP2 , DS ) UROM_3492 TMP0 = _CMOV2.NC (TMP0 , TMP1 ) ; Returns arg2 if cond met ? Difference to CMOV unknown UROM_3494 TMP1 = INTEXTRACT.HI32(TMP0 , TMPB ) UROM_3495 USEGOP1 (TMP1 , TMP0 , CONST.04.010 , DS ) UROM_3496 EOM REG_ddd = MOVE.DSZ? (CONST.0 , TMP3 , OA.4, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */)
In lss you can see similar code, but it is initializing TMP2:
UROM_32A2 macro_lss: ; xref: UROM_32A2 TMP2 = LEA.M40.SC1.DSZ32(REG.D0 , REG.D0 , CONST.05.008 , OA.9, U1.1) UROM_32A4 TMP3 = LOAD.M40.SC1.DSZ?.0(REG.D0 , REG.D0 , CONST.05.008 , OA.D, U1.1) UROM_32A5 TMP2 = LOAD.M40.SC1.DSZ16.0(CONST.0F.010 , TMP2 , OA.8, U1.1) UROM_32A6 TMP0 = LOAD.M0.SC1.DSZ64.0(CONST.04.002 , TMP2 , GDTR ) UROM_32A8 TMP1 = LOAD.M0.SC1.DSZ64.0(CONST.04.002 , TMP2 , LDTR , U2.08) UROM_32A9 TMP0 = _CMOV2.NC (TMP0 , TMP1 ) ; Returns arg2 if cond met ? Difference to CMOV unknown UROM_32AA TMP1 = INTEXTRACT.HI32(TMP0 , CONST.0 ) UROM_32AC USEGOP4 (TMP1 , TMP0 , CONST.04.02A , SEG_02 , U2.28) UROM_32AD UOP.0D4 (CONST.0 , CONST.0 ) UROM_32AE REG_ddd = MOVE.DSZ? (CONST.0 , TMP3 , OA.4) UROM_32B0 EOM USEGOP2 (CONST.04.005 , CONST.04.005 , SS , U2.20, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */)
Returns same as sha1_transform, but with SHA-1 context initialisation
sha1_transform
tmp5=ptr to block (64 bytes); tmp6= blockcount? tmp7 = context
returns: pseudohash in tmp&tmp7dwords: tmpa=tmp7[0], tmpb, tmp0, tmp1, tmp8=tmp7[4]
It is not known how sha1 finalisation is performed.
rc4_crypt
TMP5=pointer to data TMP6=datacount (UOP.3B4(TMP6,TMPC ???) ); TMP7=keyptr
Dummy buffer is provided if data are to be discarded (initial 0x200-cycle stream roll).
patch_rc4_keysetup
tmp2 = patchsize; TMP3=PATCHptr; tmp7 = keyptr;
Macrooperation saves/restores the x87 FPU, MMX technology, XMM, and MXCSR registers from the 512-byte memory image specified in the source operand.
part_macro_fxsave_xmm (686, 6D8, ...) is using weird register map:
UROM_260D TMPD = ADD.DSZ8 (TMPD , CONST_16+0A0 , OA.4) UROM_260E UOP.432 (CONST_0 , AL /*XMM0LO*/ , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */) UROM_2610 UOP.F46 (CONST_4 , TMPD , OA.8, U1.1) UROM_2611 UOP.432 (CONST_0 , TMP6 /*XMM0HI*/ , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */) UROM_2612 UOP.F46 (CONST_04+008 , TMPD , OA.8, U1.1) .org 0x2614 .utripletbits 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100 UROM_2614 UOP.432 (CONST_0 , CL /*XMM1LO*/ , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */) UROM_2615 UOP.F46 (CONST_04+010 , TMPD , OA.8, U1.1) UROM_2616 UOP.432 (CONST_0 , TMP7 /*XMM1HI*/ , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */) UROM_2618 UOP.F46 (CONST_04+018 , TMPD , OA.8, U1.1) UROM_2619 UOP.432 (CONST_0 , DL , U4.00000004 /* U4:0000 0000 0000 0000 0000 0000 0000 0100 */) UROM_261A UOP.F46 (CONST_04+020 , TMPD , OA.8, U1.1)
SSE patents.google.com/patent/US6721866B2↗
FXSAVE patent US6898700B2
Info on part_macro_fxsave_alt:
It's probably an optimization of something like "fast strings" - IA32_MISC_ENABLE bit0, except for FPU... there is documentation on the ROB_CR_BKUPTMPDR6 website and it talks about bit 2 faststrings which tests it. It's probably MSR 0x1e0, I can see if we have it set to 0x4 on Pentium.
It probably speeds up operations somehow and it's also a different ordering... I've probably seen it somewhere and we've probably written about it... I'll try to find it.
Info on REG_D0: All FXSAVE/FXRSTOR/LDMXCSR/STMXCSR only allow memory operands in any form. i.e. modrm != 11 from REG_D0 is it possible to get the resulting address via LEA.M40.SC1.DSZ32(REG.D0 , REG.D0 , CONST_05+008 , OA.9) Maybe CONST_05/OA.9 plays a role...
I don't know what the numerical value in REG.D0 is. There could be a modrmbyte there
| Offset | Size(bytes) | Field | Description |
| 0x000 | 2 | FCW | x87 FPU Control Word |
| 0x002 | 2 | FSW | x87 FPU Status Word |
| 0x004 | 2 | FTW | x87 FPU Tag Word |
| 0x006 | 2 | FOP | x87 FPU Opcode |
| 0x008 | 4 | FPU_IP | x87 FPU Instruction Pointer Offset |
| 0x00C | 2 | FPU_CS | x87 FPU Instruction Pointer Selector |
| 0x00E | 2 | RESERVED | Reserved |
| 0x010 | 4 | FPU_DP | x87 FPU Data Pointer Offset |
| 0x014 | 2 | FPU_DS | x87 FPU Data Pointer Selector |
| 0x016 | 2 | RESERVED | Reserved |
| 0x018 | 4 | MXCSR | MXCSR Register State |
| 0x01C | 4 | MXCSR_MASK | MXCSR Mask |
| 0x020 | 16 | ST0/MM0 | x87 FPU / MMX Register 0 |
| 0x030 | 16 | ST1/MM1 | x87 FPU / MMX Register 1 |
| 0x040 | 16 | ST2/MM2 | x87 FPU / MMX Register 2 |
| 0x050 | 16 | ST3/MM3 | x87 FPU / MMX Register 3 |
| 0x060 | 16 | ST4/MM4 | x87 FPU / MMX Register 4 |
| 0x070 | 16 | ST5/MM5 | x87 FPU / MMX Register 5 |
| 0x080 | 16 | ST6/MM6 | x87 FPU / MMX Register 6 |
| 0x090 | 16 | ST7/MM7 | x87 FPU / MMX Register 7 |
| 0x0A0 | 16 | XMM0 | XMM Register 0 |
| 0x0B0 | 16 | XMM1 | XMM Register 1 |
| 0x0C0 | 16 | XMM2 | XMM Register 2 |
| 0x0D0 | 16 | XMM3 | XMM Register 3 |
| 0x0E0 | 16 | XMM4 | XMM Register 4 |
| 0x0F0 | 16 | XMM5 | XMM Register 5 |
| 0x100 | 16 | XMM6 | XMM Register 6 |
| 0x110 | 16 | XMM7 | XMM Register 7 |
| 0x120 | 224 | RESERVED | Reserved (Padding to 512 bytes) |
See wiki.osdev.org/SSE#FXSAVE_and_FXRSTOR↗
Overview
FXSAVE saves x87 FPU, MMX, and XMM register state to a 512-byte memory area. The instruction requires 16-byte alignment and performs validation before saving.
- Step 1: Alignment Validation
- UOP.94B validates memory operand address
- UOP.E41 loads effective address and checks alignment
- Test address & 0x0F - must be zero (16-byte aligned)
- If misaligned: jump to exception handler (loc_352C)
- Second UOP.94B validates end of region (+0x40 bytes)
- Step 2: Memory Region Setup
- Calculate end address: base + 511 (round to 512-byte boundary check)
- UOP.CCB prepares memory region for bulk store operations
- Step 3: FPU Control State (Offsets 0x00-0x1F)
- Offset +0x00: FCW (FPU Control Word) and FSW (FPU Status Word)
- Read CR_FSW_FCW control register
- Execute FNSTSW to get current status word
- Concatenate into 32-bit value and store
- Offset +0x04: FTW (Tag Word) and FOP (FPU Opcode)
- Read from internal CR registers (0x1A3, 0x1CA)
- Mask tag word with 0x7FF
- Concatenate and store
- Offset +0x08: FIP (FPU Instruction Pointer) - high 32 bits
- Offset +0x0C: FCS (FPU Code Segment) - read from LINSEG field 9
- Offset +0x10: FDP (FPU Data Pointer)
- Read CR_FDP
- Adjust based on CR_FOP bit 30 (subtract 8 if set)
- Subtract segment base (field 11)
- Offset +0x14: FDS (FPU Data Segment) - read from LINSEG field 10
- Offset +0x18: MXCSR (SSE Control/Status Register)
- Read CR_MXCSR if SSE enabled
- Offset +0x1C: MXCSR_MASK
- Store constant 0x7F (indicates supported MXCSR bits)
- Step 4: Path Selection for FPU Register Saves
- Read ROB_CR_BKUPTMPDR6 and test bit 2
- If bit 2 set: Use alternate path (part_macro_fxsave_alt)
- Uses FSTAXMM.M40.DSZ64.1 (locked/streaming stores)
- Optimized for fast-string or streaming operations
- UOP.0D8 and UOP.1C9 called before register stores
- If bit 2 clear: Use normal path
- TRANSPORTUIP saves return address for subroutine
- UOP.CCD prepares for register stores
- Uses FSTA.M40.DSZ64.0 (normal stores)
- Step 5: x87 Register Stack (ST0-ST7, Offsets 0x20-0x9F)
- Each register occupies 16 bytes (10 bytes FP value + 6 reserved)
- For each ST(n) register:
- FSTRD_2: Extract first 64 bits (mantissa) and store at offset
- FSTRD_1: Extract remaining 16 bits (sign/exp) + padding and store at offset+8
- Offsets: ST0=0x20, ST1=0x30, ST2=0x40, ..., ST7=0x90
- Step 6: XMM Register Save (Offsets 0xA0-0x1FF)
- Check CR4.OSFXSR (bit 9) - OS supports SSE
- If not set: Skip XMM save and exit
- If set: Jump to part_macro_fxsave_xmm
- Add 0xA0 to base address (TMPD)
- For each XMM0-XMM7:
- FSTRDXMM saves low 64 bits (XMML) and store at offset
- FSTRDXMM saves high 64 bits (XMMH) and store at offset+8
- Offsets: XMM0=0xA0, XMM1=0xB0, ..., XMM7=0x170
- Step 7: Exit
- Calculate next instruction address: EIP + displacement (REG.31)
- UOP.0D8 updates instruction pointer
- End of microcode sequence (EOM)
- Alternate path (ROB_CR_BKUPTMPDR6 bit 2 set):
- Uses locked/streaming stores to avoid cache pollution
- Identical data saved, different store mechanism
- UOP.1C9 called before register saves (purpose unknown)
- FSTAXMM.M40.DSZ64.1 used instead of FSTA.M40.DSZ64.0
- Optimized for bulk operations or fast-string mode
- 80-bit x87 registers are stored as 16 bytes in memory
- FSTRD_2 extracts and stages bytes 0-7 (64-bit mantissa)
- FSTRD_1 extracts and stages bytes 8-15 (sign/exponent + padding)
- Two consecutive 64-bit stores create the 16-byte slot
- Bytes 10-15 are reserved/padding per Intel specification
- 128-bit XMM registers are stored as two 64-bit halves
- FSTRDXMM extracts low 64 bits (XMML) for first store
- FSTRDXMM extracts high 64 bits (XMMH) for second store
- Each XMM register occupies 16 bytes in memory
Weird LOAD/STA/STRD fp opcodes
restore UROM_26CA TMP2 = UOP.BC9 (CONST_04+030 , TMPB , OA.8, U1.1) UROM_26CC TMP3 = UOP.B49 (CONST_04+038 , TMPB , OA.8, U1.1) UROM_26CD ST1 = UOP.0E3 (TMP3 , TMP2 ) UROM_26EE AL = UOP.849 (CONST_4 , TMPB , OA.9, U1.1, U4.00008001 /* U4:0000 0000 0000 0000 1000 0000 0000 0001 */) UROM_26F0 TMP6 = UOP.849 (CONST_04+008 , TMPB , OA.9, U1.1, U4.00000001 /* U4:0000 0000 0000 0000 0000 0000 0000 0001 */) save (strd je 430|380 = maska 7B0; sta je 0x800|380=maska B80) UROM_0869 STRD.DSZ32 (TMP4 , CONST_0 ) UROM_086A STA.M40.SC1.DSZ32(CONST_04+010 , TMPD , OA.8, U1.1) UROM_087A UOP.7B3 (ST0 , CONST_0 ) // STRD pro floatpoint!!! UROM_087C UOP.F46 (CONST_04+020 , TMPD , OA.8, U1.1) // nejake divne STA?? UROM_087D UOP.5B3 (ST0 , CONST_0 ) // druha cast ST0 UROM_087E UOP.F46 (CONST_04+028 , TMPD , OA.8, U1.1) // nejake divne STA?? ... UROM_1FA2 UOP.432 (AL , CONST_0 , U4.00008002 /* U4:0000 0000 0000 0000 1000 0000 0000 0010 */) UROM_1FA4 UOP.F46 (CONST_4 , TMPD , OA.8, U1.1) UROM_1FA5 UOP.432 (TMP6 , CONST_0 , U4.00000002 /* U4:0000 0000 0000 0000 0000 0000 0000 0010 */) UROM_1FA6 UOP.F46 (CONST_04+008 , TMPD , OA.8, U1.1)
ENTER has imm16 (allocation) and imm8 (nesting).
- M_IMM loads what should be nesting level
- REG.31 might be allocation size
@macro_entry_ENTER: TMP1 := 0 // Initialize frame size counter // Push current EBP onto stack [SS:ESP - operand_size] := EBP // STA: PUSH EBP TMP2 := ESP - operand_size // TMP2 = new ESP after PUSH // Get nesting level from instruction immediate TMP0 := M_IMM // Assumed: first immediate (alloc_size) // or REG.31 contains nesting level? TMP0 := TMP0 AND 0x1F // Mask to 0-31 valid range if (TMP0 == 0) goto @nesting_zero @nesting_nonzero: // Nesting level >= 1: setup for frame pointer copying TMP5 := ESP - operand_size // TMP5 = temporary frame pointer TMP2 := TMP2 - operand_size // Reserve space for temp frame ptr TMP0 := TMP0 - 1 // Decrement nesting level if (TMP0 == 0) goto @nesting_one // Nesting level >= 2: copy previous display (frame chain) TMP3 := EBP - operand_size // TMP3 = source ptr (previous frames) @frame_copy_loop: // Copy (nesting_level - 1) frame pointers from previous frame TMP4 := [SS:TMP3] // Load previous frame pointer TMP3 := TMP3 - operand_size // Move to next source frame ptr [SS:TMP2] := TMP4 // Push frame pointer to new frame TMP2 := TMP2 - operand_size // Adjust destination pointer TMP0 := TMP0 - 1 // Decrement loop counter if (TMP0 != 0) goto @frame_copy_loop @nesting_one: // Nesting level was 1: push new frame pointer TMP0 := EIP + REG.31 // Calculate return address/offset? // (REG.31 might be alloc_size immediate) UOP.0D8(TMP0) // Unknown - possibly update internal state [SS:TMP2] := TMP5 // Push temp frame pointer goto @finalize_frame @nesting_zero: // Nesting level was 0: no frame pointers to copy TMP0 := EIP + REG.31 // Calculate address/offset UOP.0D8(TMP0) // Unknown operation @finalize_frame: // Common exit: set new frame pointer and allocate local space UOP.C0B(TMP1, TMP2, SS) // Unknown - stack validation/limit check? EBP := ESP - operand_size // EBP = new frame pointer (points to saved EBP) ESP := TMP2 + TMP1 // ESP = final stack pointer after allocation // (TMP1=0, so ESP=TMP2=final position)
- EBP - Current frame pointer
- ESP - Current stack pointer
- M_IMM - Allocation size (imm16) [or nesting in TMP0 load?]
- REG.31 - Nesting level (imm8) [speculation - or allocation size?]
- EIP - Current instruction pointer
- TMP0 - Nesting level counter / address calculation
- TMP1 - Frame size (always 0 in this sequence)
- TMP2 - New ESP tracker / destination pointer
- TMP3 - Source pointer for frame copy loop
- TMP4 - Temporary frame pointer value
- TMP5 - New frame pointer value (for nesting > 0)
UOP.000(ALIAS.014, ALIAS.014) // Synchronization point // Read mask register (xmm_sss) into TMP registers TMP0 := MOVEFROMXMM(REG_xmm_sss_l) // Mask bytes 0-7 (low 64 bits) TMP1 := MOVEFROMXMM(REG_xmm_sss_h) // Mask bytes 8-15 (high 64 bits) // Validate destination address and prepare streaming store UOP.F4B(EDI, offset=8) // Validate DS:[EDI] and DS:[EDI+8] // Masked store of low 64 bits (bytes 0-7) DS:[EDI+0..7] := REG_xmm_ddd_l & TMP0 // Write masked by TMP0 // Masked store of high 64 bits (bytes 8-15) DS:[EDI+8..15] := REG_xmm_ddd_h & TMP1 // Write masked by TMP1
It's possible the code handles MASKMOVQ as well (see OA.8 in actual code, msrom-6d8 0x3394)
See www.felixcloutier.com/x86/maskmovdqu↗
Info from CPUID 652
- Step 1: Special MSR Check (0x0C61-0x0C64)
- SUB(TMP0, 0x79) If zero → patch_load handler ; MSR 0x79 = IA32_BIOS_UPDT_TRIG (microcode update trigger) This MSR has a dedicated fast path for loading microcode patches
- Step 2: MSR Metadata Lookup via PLA (0x0C68-0x0C6D)
- CR_MSRPLA = MSR Programmable Logic Array (hardware lookup table)
- Address computation:
- TMP0 = (MSR_number << 4) | 8
- Write to CR_MSRPLA_ADDR
- Read CR_MSRPLA_DATA → TMP5
- TMP5 metadata format (decoded at 0x0C76-0x0C81):
- Bit 11: Route via CRBUS (inter-core bus)
- Bit 12: Write permission (if 0 → #GP fault)
- Bit 13: (unknown purpose)
- Bit 14: Address doubling (TMP4 << 1 if set)
- Bit 15: Check msrmap bitmap for OS-level permission
- Bits 8-?: Base register address for this MSR
- PLA allows Intel to map MSR numbers to internal control registers without hardcoding every MSR in microcode
- Step 3: Permission Validation (0x0C76-0x0C99)
- A. Hardware permission (bit 12): If not set → 0x0C78 jumps to msr_throw_gpf
- B. OS permission via msrmap (if bit 15 set):
- - Call read_msrmap at 0x0C89, 0x0C98
- - msrmap is a bitmap that OS can configure
- - Each bit allows/denies a specific MSR
- - Checked at 0x0C8A, 0x0C99: AND(value, mask)
- - If forbidden bits set → msr_throw_gpf
- Step 4: Register Address Calculation (0x0C79-0x0C80)
- TMPD = (metadata >> 8) & 0xFF // Base address
- If bit 14 set:
- TMPD += (MSR_low_bits << 1) // Address doubling for MSR ranges
- Else: TMPD += MSR_low_bits
- TMPD &= CONSTROM.00F // Mask to valid range
- Step 5: MSR-Specific Handlers (0x0CA0-0x0CED)
- Dispatcher routes to specific code based on TMPD (final register address):
- 0x0CA0: TMPD >= 0xA0 (MTRR range 0xA0-0xDF) Variable MTRRs for memory type configuration
- 0x0CA8: TMPD == 0xFE PAT (Page Attribute Table) or MTRR capability register
- 0x0CAD: TMPD < 0xA0 (lower range) Bit 9 of MSR number tested at 0x0CAD If set → may be debug/control register range
- 0x0CB2: TMPD in 0x00-0x9F, bit 0 of MSR set Debug register write: DR0-DR7 or similar Checks TMP1 against CONSTROM.0C2 mask Writes to control register via MOVETOCREG at 0x0CBC Special handling for bit 11 (BTS/BTR at 0x0CB8-0x0CBA)
- 0x0CC2: TMPD == CONSTROM.009 (likely 0x1D9) IA32_DEBUGCTL (Debug Control MSR); Masks value with CONSTROM.1E7, CONSTROM.018; Writes to CONST.0E.18A at 0x0CCC;
- 0x0CCE: TMPD == 0x76 ; Platform-specific control register; Writes to CONST.0E.1B3, CONST.0E.1B1 at 0x0CD1-0x0CD5
Microcode updates are delivered through a write to MSR 0x79 (IA32_BIOS_UPDT_TRIG), which is the interface described in the Intel Software Developer Manual. Inside wrmsr_core, the MSR address in TMP0 is compared against 0x79 before the normal MSRPLA dispatch path executes. On a match, control transfers to wrmsr_patch_load rather than performing a standard MSR write.
The first instruction of wrmsr_patch_load calls UOP.203, which is observed elsewhere in the microcode as a privilege level check. If the check fails, execution transfers to generic_macro_fault_gp, which raises #GP(0) via SIGEVENT. This matches the architecturally documented behavior that WRMSR from CPL greater than 0 raises a general protection fault.
The value in TMP1 at this point carries the linear address of the update data buffer. This is consistent with the architectural convention for IA32_BIOS_UPDT_TRIG where EAX holds the address of the update data in memory, though the exact register routing from the WRMSR decode path was not fully traced and should be treated as an inference.
@wrmsr_patch_load_scan
TMP7 is set to zero, initializing the error accumulator that will be used throughout the remainder of the procedure.
TMP0 is then loaded from internal control register CR[0x1BE], whose precise architectural role is not directly recoverable from this code. BTR is applied to bit 3 of TMP0. If bit 3 was already clear, carry is not set and execution falls through immediately to @wrmsr_patch_load_derive_auth_key.
If bit 3 was set, a 64-bit word is loaded from linear memory at the address in TMP1. Bit 8 of the low 32 bits of this word is then tested. If bit 8 is set, TMP0 is forced to zero and the loop restarts from the top, where BTR on a zero value will produce no carry and redirect to @wrmsr_patch_load_derive_auth_key on the next pass. If bit 8 is clear, TMP0 is computed as CONST.14.03C + EDX and execution jumps forward to @wrmsr_patch_load_init_msram_write, bypassing the authentication key derivation entirely.
The precise semantics of CR[0x1BE] bit 3 and the role of bit 8 in the memory word cannot be determined from this code alone without additional context about what writes those values. The branch to @wrmsr_patch_load_init_msram_write that skips authentication is notable and may represent a fast path for a previously validated blob, but this is speculative.
@wrmsr_patch_load_derive_auth_key
This routine constructs a stepping-specific authentication reference value.
Two consecutive 32-bit reads are performed from the internal microcode store bus. The address written to MS_CR_ADDR is formed by concatenating the byte at internal address CONST.0E.036 with the value 0xFC. The two successive reads from MS_CR_DATA yield a 64-bit seed value whose higher word is read into TMP0.
TMP0 is then rotated left by the value in CR_STEPPING. The result has 6 added to it, producing an intermediate value in TMPC. This is then added to the low 32 bits of the first 64-bit quadword loaded from the update buffer at [TMP1], and the sum is masked with 0x9C, yielding a 7-bit index. This index is passed to FREADROM, which reads a value from a table embedded in the microcode ROM itself. The result is stored in TMP6 as the expected authentication reference.
The rotation by CR_STEPPING is the mechanism that makes authentication stepping-specific. The same input data produces a different FREADROM index on different steppings, and therefore a different expected reference value. An update blob that passes authentication on one stepping will produce a divergent expected value on a different stepping and fail.
After the reference value is established, TMPB is set to 0x54 and TMP0 is loaded from CONSTROM.03C. These values serve as the MSRAM write count and starting address respectively, used in the subsequent phase.
@wrmsr_patch_load_init_msram_write
TMP1 is advanced by 8, moving the buffer pointer past the header quadword. Two MS_CR_ADDR-based write destination values (0x1BC and 0x1BD, corresponding to MS_CR_ADDR and MS_CR_DATA respectively) are conditionally loaded into TMP9 using a MERGE and CMOV sequence that checks both TMP7 (the error accumulator) and TMP2 (a validity flag from the authentication output). If either indicates an error state, TMP9 is set to 0x1FF instead, which is a sentinel value that will cause subsequent writes to target a null sink rather than live MSRAM. This is the mechanism by which a partially authenticated update is prevented from corrupting MSRAM state even if the write loop continues executing.
@wrmsr_patch_load_msram_write_loop, @wrmsr_patch_load_commit_lo32, @wrmsr_patch_load_commit_hi32
The write loop loads 64-bit words from the update buffer at [TMP1] in sequence. For each quadword, the high 32 bits are extracted into TMPA.
If TMP2 indicates the carry flag is set (signaling the cryptographic authenticator has flagged a problem), the load is intercepted and control transfers to sub_patch_cryptfunc before the value is committed. Otherwise, the low 32 bits are written to MSRAM via MOVETOCREG(TMP9, TMP5) in @wrmsr_patch_load_commit_lo32, TMP5 is updated to TMPA, TMPB is decremented, and @wrmsr_patch_load_commit_hi32 writes the high 32 bits and advances TMP1 by 8. The loop continues until TMPB reaches zero.
sub_patch_cryptfunc
This subroutine is called from multiple points during both the MSRAM write phase and the CRBUS update phase. It implements a 37-round (loop count 0x25) stream authenticator.
On entry, TMP4 is loaded with a 64-bit value formed by concatenating TMPC and TMP4. Each round rotates TMPC right by 1. If the shifted-out bit was set (carry clear after ROR), TMPC is XORed with TMP4; otherwise it is XORed with zero. This is a standard Galois LFSR construction where TMP4 holds the feedback polynomial. After 37 rounds, the high 32 bits of TMP4 are XORed into TMPC, then TMP5 is XORed in, then TMP6 (the reference value derived in @wrmsr_patch_load_derive_auth_key) is XORed in, yielding TMP0. TMP6 is then updated to the old TMP5, and TMP5 is updated to TMP0, advancing the running authentication state. The function returns via U_JMP_INDIR to the address saved in TMP3 by the calling TRANSPORTUIP instruction.
Because TMP6 holds the stepping-keyed reference value and it is folded into the running state on every call, every authenticated block in the update blob depends on all previous blocks and on the stepping identity of the target CPU.
@wrmsr_patch_load_crbus_first_verify
After the MSRAM write loop completes, execution continues into the CRBUS update phase. The first entry is handled separately. TMPB has bit 8 cleared via BTR, and the result is used as an index into FREADROM, yielding an expected reference value. This expected value is subtracted from TMP5 (the current authenticator state), and the result is ORed into TMP7. If TMP5 matched the expected value, the subtraction yields zero and TMP7 is unchanged; any mismatch leaves a nonzero residue that will propagate through TMP7 for the remainder of the procedure.
@wrmsr_patch_load_crbus_rmw_loop, @wrmsr_patch_load_crbus_load_addr, @wrmsr_patch_load_crbus_apply_mask, @wrmsr_patch_load_crbus_or_newval, @wrmsr_patch_load_crbus_auth_commit
The CRBUS update phase applies a sequence of authenticated read-modify-write operations to internal control bus registers.
@wrmsr_patch_load_crbus_rmw_loop loads the next 64-bit word from the buffer. If sub_patch_cryptfunc signals an authentication failure the write is suppressed. Otherwise @wrmsr_patch_load_crbus_load_addr extracts the low 16 bits of the word into TMP9 as the target CRBUS address and advances to the next quadword for the mask operand.
@wrmsr_patch_load_crbus_apply_mask reads the current value of the CRBUS register at the address in TMP9. If TMP5 is nonzero a fixed sentinel address 0x16F is used instead, effectively reading from a safe location rather than the intended target. The value read is ANDed with the mask from TMP5 to produce a masked current value in TMP8.
@wrmsr_patch_load_crbus_or_newval ORs the new value bits into TMP8. TMPB is decremented and the next quadword is loaded for the authentication check.
@wrmsr_patch_load_crbus_auth_commit performs the final per-entry authentication: it reads an expected value from FREADROM using the current buffer word, subtracts TMP5, and ORs any discrepancy into TMP7. If authentication passed and TMP7 is still zero, MOVETOCREG(TMP9, TMP8) commits the masked write to the CRBUS register. The loop then continues from @wrmsr_patch_load_crbus_rmw_loop until TMPB reaches zero.
The CRBUS writes produced by this loop include writes to MS_CR_MATCHPATCH0 (0x1B8), MS_CR_MATCHPATCH1 (0x1B9), and MS_CR_MATCHPATCH2 (0x1BA). These registers hold the µop-cache fetch addresses that the front end will intercept and redirect to the newly loaded patch content. Writing these registers is therefore the act that activates the patch. Any previously installed patch whose match addresses are overwritten by this write sequence is implicitly deactivated at the same moment, since the match registers no longer point to it.
Error Recovery: @wrmsr_patch_load_invalidate_retry
If the TMP2 carry flag indicates an authentication failure at the end of the CRBUS write phase, execution reaches @wrmsr_patch_load_invalidate_retry. The value 0x3B0 (0xEC shifted left by 2) is subtracted from TMP1, rewinding the buffer pointer, and execution jumps back to @wrmsr_patch_load_derive_auth_key to attempt re-authentication from a different offset. This path also sets the BTS bit on TMP2 to force sub_patch_cryptfunc to produce a known initial state before retry.
If TMP7 is nonzero at the end of the CRBUS phase (meaning at least one authenticated comparison failed), a separate error exit path at 0x18F5 is taken before reaching @wrmsr_patch_load_invalidate_retry. This path writes 0x1FF to MS_CR_MATCHPATCH0, MS_CR_MATCHPATCH1, and MS_CR_MATCHPATCH2. 0x1FF is the all-ones value for the 9-bit match register field, which the front end treats as no-match, meaning no µop-cache address will be intercepted. An additional internal register at 0x1BB is also written with 0x1FF. This sequence ensures that any partial CRBUS state written during a failed update attempt is fully neutralized before the handler returns, leaving the CPU in the same functional state as if no update had been attempted.
Pentium M
IRET entry │ ├─[size check fail]───────────────────────────────→ #GP │ ├─[NT=1]──→ @macro_iret_tss_link │ → sub_tss_save (save all regs to old TSS) │ → @tail_tss_load_continue (clear busy bit in GDT) │ → tail_tss_load (load all regs from new TSS) │ ├─[CR3 changed] → sub_tlbflush_and_a20 │ │ ├─[PAE] → sub_pae_pdpte │ │ └─────────────────────────┐ │ ├─[descriptor invalid]→ #GP │ │ └─────────────────────→ @macro_iret_exit │ ├─[pop EIP/CS/EFLAGS] + [EFLAGS merge] │ ├─[illegal flags]───────────────────────────→ #GP │ ├─[VM=1]────────→ @macro_iret_v86_return │ │ → loc_209E → loc_20A6 → @macro_iret_exit │ └─[VM=0]────────→ @macro_iret_samepriv │ ├─[bad CS]────────────→ #GP │ ├─[priv change] │ │ → @macro_iret_privchange │ │ → loc_20A6 → @macro_iret_exit │ └─[same priv] │ → EOM: SIGEVENT 0xE7 (done) │ └─ @macro_iret_exit: UOP.0D8 commits new EIP to front-end.
How It Works
Based on macro_iret-msrom-6d8.asm.
Subroutine names reflect labels in the microcode listing. All paths originate at UROM_1BE0 (macro_iret). Indented "→ sub" means a TRANSPORTUIP-based microcode call (callee returns via JMP_INDIR back to the encoded return address). ========================================================== PART 1 — ENTRY AND DISPATCH ========================================================== macro_iret (UROM_1BE0) ----------------------- 1. Read an internal size/mode state via UOP.204(CR_at_0E.004). Subtract REG_OP_Size (the current operand-size attribute). If the result is zero: the frame size is inconsistent → #GP. [This checks that the instruction encoding matches the current stack-frame size expectation.] 2. Write CR_SMM_status = 4 (marks IRET-in-progress for SMM interaction). 3. Read current EFLAGS via UOP.208(TMPA). Test bit 14 (NT flag). If NT=1 → jump to @macro_iret_tss_link (hardware task return). Otherwise continue for normal stack-based IRET. ========================================================== PART 2 — NORMAL IRET (no NT, no task switch) ========================================================== Stack frame pop (UROM_1BEA) ---------------------------- 1. Compute stack-pointer stride (TMP9) for 16/32-bit operation size. 2. Speculatively load three values from SS:ESP: TMP7 = new EFLAGS (DSZ? = operand-size wide) TMP2 = new CS selector (always 16-bit) TMP3 = new EIP (DSZ? = operand-size wide) These three are the architectural IRET frame (EIP, CS, EFLAGS in stack order low→high). EFLAGS permission merge (UROM_1BF4) -------------------------------------- Determines which EFLAGS bits this IRET may change, based on CPL and IOPL. The logic applies the SDM rules: - CPL=0 may change any flag. - CPL>0 but CPL≤IOPL may change IF. - CPL>IOPL may not change IF or IOPL. - VM flag changes are restricted to CPL=0. Implementation steps: a. Load base permission mask 0x00254FD5 from ROM. b. UOP.202: select/merge with alternate mask 0x00254DD5 (differs in the IF bit position). c. UOP.203(CONST.14.13E): apply IOPL/CPL comparison from internal state table to further restrict the mask. d. UOP.209(CONST.14.0BD): apply VM-mode filter. e. AND result to 0x1FF (lower EFLAGS bits only, this pass). f. UOP.204(CONST.14.125): final mask read/merge from state table. g. Separately compute the VM-flag transition permission: TMP8 = TMP7 << 10 (shift EFLAGS copy) TMPB = UOP.204(TMP8, TMP7) & 0x00080000 (isolates the VM bit change: old VM XOR new VM → detect 0→1) BTR TMP7, bit 19 (clear VM in the to-be-committed EFLAGS copy) TMP7 = TMPB | TMP7 (re-insert corrected VM) h. Final merge: writable = TMP5 (mask), frozen = ~TMP5 TMP7 = (TMP5 AND new_flags) OR (~TMP5 AND SystemFlags) i. Check that VIF|VIP are legal in the result; if both clear → #GP. j. Commit: SystemFlags = TMP7. VM check: (UROM_1C0D) UOP.201 extracts VM flag from TMP7. If VM=1 in new EFLAGS → jump to @macro_iret_v86_return. Otherwise → jump to @macro_iret_samepriv. ========================================================== PART 3 — SAME-PRIVILEGE PROTECTED-MODE RETURN ========================================================== @macro_iret_samepriv (UROM_2E58) ---------------------------------- TMPC = 0 (clear, means same-privilege). TMP2 = new CS selector (already loaded from stack in Part 2). 1. Load the 64-bit GDT entry for TMP2 from GDTR. 2. Load the 64-bit LDT entry for TMP2 from LDTR. 3. UOP.263: select GDT or LDT entry based on TI bit in selector. 4. USEGOP4(type=0x59): validate as a code segment descriptor. If overflow (bad descriptor) → UROM_3A00 (#GP). 5. UOP.CC1: signal pending CS load to segment unit. 6. Advance ESP: ESP_20 += TMP9 (pop the IRET frame off the stack). 7. Test REG.37 bit 18 (privilege-change flag): If set → jump to @macro_iret_privchange (outer ring, need SS:ESP pop). 8. UOP.62A(0E.102, TMPB): commit new CS descriptor into descriptor cache. 9. UOP.0D8(TMP0, TMP3): redirect instruction fetch to new CS:EIP. 10.UOP.0D4(0, 0): finalize redirect (pipeline signal). 11.Update REG.31 with new CS generation counter. 12.REG.37 = 0xFF (reset internal state flags). 13.USEGOP2(CS, type 4): mark CS as accessed. 14.EOM: SIGEVENT(TMP3, 0xE7) → instruction architecturally complete. ========================================================== PART 4 — INTER-PRIVILEGE RETURN (outer ring) ========================================================== @macro_iret_privchange (UROM_209D) ------------------------------------ Reached when the new CS selector has a higher RPL than the current CPL (returning to less-privileged code). After the normal CS pop the stack also contains SS:ESP for the outer ring. 1. UOP.62A(0E.102, 0E.102): commit pending CS state. 2. Read new CS RPL from descriptor cache → update REG.31 generation. 3. REG.37 = 0xFF (reset state). 4. USEGOP2(CS, type 4): finalize CS descriptor cache. 5. USEGOP2(SS, type 5): finalize SS descriptor cache (already loaded from the extended inter-privilege frame on the stack). [The SS:ESP values are loaded earlier in the stack-pop sequence via @macro_iret_v86_return or the priv-change branch of samepriv.] Falls through to loc_20A6. loc_20A6 (UROM_20A6) — task-switch / inter-priv serialization ---------------------------------------------------------------- 1. STRD(0,0) + UOP.134: pipeline drain and memory fence. 2. Read internal reg 0E.090, set bit 1 (mark serialization in progress). 3. Spin loop (loc_20AD): poll internal reg 0E.022 bit 5 until set (serialization acknowledged by hardware). 4. Fall through to @macro_iret_exit. @macro_iret_exit (UROM_20B1) ------------------------------ 1. Fl2.Fl3: SIGEVENT(TMP3, 0xE7) → instruction architecturally complete. 2. UOP.20A(CONST.14.109): read VIF/VIP permission mask. 3. AND with committed SystemFlags; subtract 0x00180000 (VIF|VIP mask). If zero → #GP (VIF/VIP inconsistency in final check). 4. EOM.Fl2: UOP.0D8(0, TMP3) → redirect front-end to new EIP. (This is the task-switch exit EIP commit.) ========================================================== PART 5 — RETURN TO V86 MODE (VM=1 in new EFLAGS) ========================================================== @macro_iret_v86_return (UROM_2C95) ------------------------------------- Reached when new EFLAGS has VM=1 (returning to a Virtual-8086 task). The interrupt frame for a V86 return is larger than the standard 3-word frame: it additionally contains on the stack (high to low): GS, FS, DS, ES, SS, ESP, EFLAGS, CS, EIP 1. Save old ESP (TMP5 = ESP). 2. BTS REG.37 bit 17 (mark V86 transition). 3. Advance ESP_20 by TMP9 (skip past the basic EIP/CS/EFLAGS). 4. Pop additional frame words from SS:ESP: TMP4 = new ESP (operand-size wide) TMP8 = new SS selector (16-bit) TMP1 = new ES selector TMP0 = new DS selector TMP9 = new FS selector TMP6 = new GS selector 5. Build CS descriptor: WRSEGFLD(CS, selector=TMP2) USEGOP4(CS, type 9) (V86 code segment) UOP.62A(0E.102, TMPB): commit CS. 6. Load DS, ES, FS, GS via USEGOP3/USEGOP1 (real-mode style, no descriptor validation — V86 segments are base=selector*16). 7. TMP3 = zero-extended TMP3 (EIP is 16-bit in V86). 8. ESP_OPSZ = TMP4 (restore new outer-ring ESP). 9. CR_CPL = 3 (V86 mode always runs at CPL=3). 10.Load SS: WRSEGFLD(SS, TMP8); USEGOP4(SS, type 0x0A). 11.Jump to loc_209E (continue at inter-privilege finalization). ========================================================== PART 6 — HARDWARE TASK SWITCH (NT=1) ========================================================== Overview: IRET with EFLAGS.NT=1 is not a stack-pop but a full hardware task switch back to the previous task (identified by the back-link field at offset 0 of the current TSS). Three subroutines are involved: sub_tss_save — save current CPU state into current TSS @tail_tss_load_continue — clear busy bit of old TSS, then load new TSS tail_tss_load — load new task state from new TSS @macro_iret_tss_link (UROM_21BD) ----------------------------------- 1. Set up REG.37 task-switch flags. 2. Read current TR selector. 3. Load word at TR:0 (the back-link field = previous task's TSS selector). 4. Load the GDT descriptor for the back-link selector. 5. USEGOP4(type=0xA2): validate it is a TSS descriptor. 6. TRANSPORTUIP → @tail_tss_load_continue: Encode the return address for after sub_tss_save+tail_tss_load_continue completes. This is the microcode "call" mechanism. 7. TMP4 = EIP_30 + REG.31 (current EIP for saving into TSS). 8. TMP9 = TMPA & 0x003F3FD7 (mask current EFLAGS for saving). 9. CR_SCP15 = 16 (record current operand size for TSS stride calculation). 10.Jump to sub_tss_save. ------------------------------------------------------------------------ SUBROUTINE: @tail_tss_load_continue (UROM_3100) ------------------------------------------------------------------------ Clears the busy bit (B-bit) in the old task's TSS GDT descriptor. This marks the old task as no longer running, allowing it to be entered again in the future. 1. TMP2 = current TR selector value (old task). 2. TMP7 = TMP2 & 0x1F8 (byte offset of descriptor in GDT = index × 8, lower bits masked off). 3. UOP.F0B(CONST.6, TMP7, GDTR): acquire bus lock on the GDT entry. 4. STRD(0,0) + UOP.134: drain store buffer, serialize pipeline. 5. XLOAD.DSZ64.1(GDTR, TMP2): atomically load 64-bit old-TSS descriptor. 6. Extract high 32 bits; BTR bit 9 (= bit 41 of the descriptor = B bit in the type field of a TSS descriptor, SDM §3.5). 7. Recombine 64-bit descriptor with B-bit cleared. 8. STRD(modified_descriptor) + STA.DSZ64.1(GDTR, TMP7): atomically write back the modified descriptor. 9. TMP6 = new-task TSS base address (from USEGOP2 on the back-link selector). 10.TMP5 = 0 (clear temporary). 11.Jump to tail_tss_load. ------------------------------------------------------------------------ SUBROUTINE: tail_tss_load (UROM_1056) ------------------------------------------------------------------------ Loads the new task's CPU state from its TSS and performs final setup. (Also the target of TRANSPORTUIP return from @tail_tss_load_continue.) Setup: 1. CR_SCP14 = EIP_30 (save current EIP for debug). 2. Set DebugCtlMSR.BTF (bit 14) — arm Branch Trap Flag so the first branch in the new task triggers a #DB if single-step was active. 3. REG.27 = 0 (will accumulate new TSS base). 4. Read new-task TSS descriptor type field (TMPC). 5. BTEST TMPC bit 11: 0 = 286-style (16-bit) TSS, 1 = 386+ (32-bit) TSS. 6. TMP9 = stride: 2 (16-bit TSS) or 4 (32-bit TSS). 7. Save TSS-type flag to CR 0E.109. 8. Spin-wait (ALIAS.065) — wait for TSS read access. 9. UOP.1CA — synchronization point. 10.Compute TSS body start offset TMP7 (depends on 16/32-bit TSS type). 11.If 32-bit TSS: load CR3 from TSS offset 0x1C → TMP8. 12.Sequential load loop from new-task TSS body: Each field: LOAD 16-bit value from TR:TMP7; TMP7 += TMP9. Fields loaded in order: TMP3 = new EIP → SIGEVENT(TMP3, 0x5C) commits EIP_30. TMP4 = new EFLAGS (lower 16 bits) EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI (each: zero-fill upper half via MOVE(0x1FF), then load low word) SystemFlags = EFLAGS merge (BTR RF, then set from TMP5/TMP4) CR0 = BTS(CR0, bit 1) (set MP flag for new task) ES selector → load and write ES descriptor fields CS selector → extract RPL → write to CR_CPL → write CS descriptor SS selector → write SS descriptor DS selector → write DS descriptor If 32-bit TSS: FS selector, GS selector If 16-bit TSS (286): skip FS/GS → force FS=GS=null at loc_10CD. LDTR selector → write LDTR descriptor (type 7 = LDT) CR3 / paging check: (UROM_10DD) Only if 32-bit TSS and CR0.PG=1: If new CR3 == current CR3 → skip (no TLB flush needed). Otherwise write new CR3, then: TRANSPORTUIP → loc_10F0 (encode return address for after flush) Check CR4 for PAE or PSE: If PAE/PSE active → call sub_pae_pdpte (validate PDPTEs first). Otherwise fall into loc_10F0 directly. loc_10F0: (UROM_10F0) TRANSPORTUIP → loc_10F5 (encode second-level return address). Jump to sub_tlbflush_and_a20. Return here after TLB flush; re-read TSS type into TMPC. Descriptor validation: (UROM_10F8) After loading and (if needed) flushing TLB, validate all loaded segment descriptors against the GDT/LDT: 1. USEGOP2(CS), USEGOP2(SS): verify CS/SS descriptor cache consistency. 2. Load LDTR → validate LDT descriptor (type 0xE7). 3. Commit SystemFlags = TMP5 (new EFLAGS fully applied). 4. Compute new effective CPL from CS.RPL and EFLAGS.IOPL; write CR_CPL. 5. Validate SS descriptor: Load 64-bit from GDT+LDT; UOP.263 selects correct table. USEGOP4(type=0x2A): must be a writable data segment. USEGOP2(SS, type 5): finalize SS descriptor cache. 6. Validate DS: load both GDT+LDT entries; UOP.263 selects. USEGOP3/USEGOP1(DS, type 0x10): data segment. 7. Validate ES: same pattern. 8. If 32-bit TSS: validate FS, GS the same way. 9. Validate CS (loc_114A): Load both GDT+LDT entries; UOP.263 selects. USEGOP4(type=0x79): must be a code segment. USEGOP2(CS, type 4): mark accessed. UOP.62A(0E.102, TMPB): commit CS descriptor. UOP.CC1: signal CS load. Compute REG.31 (CS descriptor generation counter). Error-code push check: (UROM_1160) Read CR_SCP15 bits [5:4] = saved IOPL comparison. If the task switch was triggered by an exception that has an error code (IOPL bits indicate this), push the error code onto the new stack: If 32-bit TSS: push 32-bit error code, ESP -= 4. If 16-bit TSS: push 16-bit error code, ESP -= 2. Debug register finalization: (UROM_1175) 1. DR7 &= 0xFFFFFEAA: clear all local breakpoint-enable bits (LE0–LE3). New task starts with hardware breakpoints disabled. 2. If 32-bit TSS: load word at TR:0x64 (T-bit flag from TSS). If T-bit set: set DR6.BS (bit 15) → a #DB will fire on first instruction of the new task (TSS debug-trap feature, SDM §7.3.1). 3. ALTDR6 = (ALTDR6 | T-bit-in-DR6.BS) & ~bit14 (clear BT bit). 4. CR_SMM_status = 1 (mark task-switch complete for SMM). 5. BTR DebugCtlMSR bit 9 (clear BTF — armed at entry, now cleared). Final dispatch: (UROM_118E) Test REG.37 bit 16 (was this triggered by a bus-lock / serialization path?): If set → loc_20A6 (serialization loop). Otherwise → @macro_iret_exit. ========================================================== FAULT EXIT ========================================================== @macro_iret_fault_gp (UROM_35BA) ----------------------------------- Reached from: - Entry size check failure (1BE5) - NT-flag stack-frame validation failure (1BF2) - Illegal EFLAGS bits (IOPL/VM) after merge (1C0C) - Final VIF/VIP consistency check failure (20B6) - Any descriptor validation overflow (via UROM_3A00) - PAE PDPTE reserved-bit violation (via loc_352C at 352C, which falls through to 35BA's pattern or signals independently) 1. TMP7 = 0xC1 (exception code: #GP vector 0x0D in P6 internal format). 2. TMP6 = UOP.120(0x0D, 0x0D) — prepare exception descriptor for #GP. 3. EOM: SIGEVENT(TMP6, TMP7) — deliver #GP(0) to the exception handler. Microcode execution ends; hardware begins exception delivery.
sub_tss_save
From Pentium M 6D8
Saves all current CPU architectural state into the current (old) task's TSS. 1. Reset REG.37 flags for save operation. 2. Check new-task TSS descriptor size (16-bit or 32-bit). UOP.CC1: initialize sequential-write pointer into new-task TSS body. 3. Check current (old) TR descriptor size to determine field stride: stride TMPA = 2 (for 286-style 16-bit TSS) or 4 (32-bit TSS). 4. Spin-wait (ALIAS.178) — wait for TSS write access to be granted. 5. UOP.1CA — synchronization point after spin. 6. Check REG.37 bit 21: if already saved (re-entry guard) → skip to loc_27CE. 7. BTS REG.37 bit 21 (mark save-in-progress). 8. Sequential store loop into current TSS body (via STRD+STA pairs): Each field: STRD stages the register value; STA writes it to TR:TMP7; then TMP7 += TMPA. Fields written in order (matches SDM Table 7-1): Offset 0x20 (stride-dependent): EIP Offset +: EFLAGS (masked via TMP9) Offset +: EAX Offset +: ECX Offset +: EDX Offset +: EBX Offset +: ESP Offset +: EBP Offset +: ESI Offset +: EDI Offset +: ES selector (RDSEGFLD → STRD → STA) Offset +: CS selector Offset +: SS selector Offset +: DS selector If 32-bit TSS: Offset +: FS selector Offset +: GS selector 9. BTR REG.37 bit 21 (clear save-in-progress). 10. loc_27CE: set up read access to the new task's TSS. Read TMP2 = base of new-task TSS, TMP5 = descriptor type. UOP.CC9 × 2: initialize sequential-read pointer into new-task TSS body. 11. JMP_INDIR TMPC — return via TRANSPORTUIP to @tail_tss_load_continue.
sub_tlbflush_and_a20
From Pentium M 6D8
Called via TRANSPORTUIP (return address in TMPE) when CR3 changes. 1. STRD(0,0) + UOP.131(LINSEG): perform full TLB flush. 2. Write internal CR 0E.154 = 0x20 (TLB-flush-complete marker). 3. SIGEVENT(0xCC): signal TLB flush completion to the memory subsystem. 4. Read CR_A20MASK; BTR bit 7; BTR bit 16: Reset A20 gate override bits — new task uses normal A20 state. 5. Write back CR_A20MASK. 6. JMP_INDIR TMPE → return to TRANSPORTUIP caller (loc_10F0).
sub_pae_pdpte
From Pentium M 6D8
Called when CR4.PAE or CR4.PSE is set and CR3 changes. Loads and validates the four Page Directory Pointer Table Entries. 1. Set "PAE in progress" flag (internal CR 0E.054 bit 7). 2. Combine new CR3 page-frame (from TMP2 & 0xFFFFF000) with low bits of current internal PAE register → write to internal CR 0E.051. 3. TMPC = 4 (loop counter: 4 PDPTEs to validate). 4. TMP1 = 0x162 (base index of PDPTE storage CRs). 5. TMP0 = new CR3 & 0x00000FE0 (byte offset into PDPT structure). 6. Validation loop (loc_2AF1): TMP8 = LOAD.DSZ64.0(LINSEG, TMP0) ; load one 64-bit PDPTE TMP0 += 8 Check TMP8 & 0x000001E6: 0x1E6 = bits 8:1,6:3,1 in PDPTE = reserved bits that must be 0. If any set → jump to loc_352C (#GP — reserved bit violation). Store low 32 bits: MOVETOCREG(TMP1, TMP8); TMP1 += 1. Check high 32 bits & 0x1FF: Upper bits of PDPTE also reserved. If any set → jump to loc_352C (#GP). Store high 32 bits: MOVETOCREG(TMP1, TMP6); TMP1 += 1. TMPC -= 1; loop if not zero. 7. Copy loop (loc_2B05): copy all 8 half-words from CRs 0x162–0x169 to CRs 0x063–0x06A (final MMU-visible PDPTE register file): 8 iterations: MOVETOCREG(0x63+i, MOVEFROMCREG(0x162+i)). 8. BTR internal CR 0E.054 bit 7 (clear PAE-in-progress). 9. JMP_INDIR TMP7 → return to TRANSPORTUIP caller (loc_10F0).
The author is not affiliated with, endorsed by, or sponsored by Intel Corporation or its affiliates. All trademarks, including but not limited to Intel, Pentium, and any other registered or unregistered marks mentioned herein, are the property of their respective owners. Their use in this context is solely for descriptive and informational purposes and constitutes nominative fair use under applicable trademark laws.