Ray Of Hope

Braindump + Segmentation and Paging And Xen

32 bit PV guest: guest kernel executes in ring 1 and guest user space executes in ring3.
64 bit PV guest: Guest kernel executes in ring 3 and guest user space also executes in ring 3.

In 32 bit PV guest Xen used to use Segmentation based protection to protect Xen memory from being accessed by guest memory. Xen and Guest used to have their own segments. But in 64 it mode, the segmentation based protection is not used as segmentation is deprecated, so protection based on segmentation was not possible, so we moved the guest kernel also to ring3 and then used paging to protect Xen memory from guest.

When there is a context switch from guest userspace to guest kernel, then in 32 bit PV mode there is no need to TLB flush as the segments of guest userspace and Xen are in different segments. But in case of 64 bit PV mode, when this context switch happens, then we need to do a TLB flush . ? why.. the pages are still protected using paging structure.

Its totally wrong that segmentation has been disabled and not used in 64 bit mode.. Segmentation hardware unit is still present and is still used in translation of virtual addresses to liner addresses.
Yes, now its a flat memory model, so , as documented, base address of addresses pointed by these registers (cs, ds etc) can be taken as 0 and also the memory limits don’t apply to them apart from gs and fs registers(GS and FS registers have been retained in 64 bit mode also). So effectively these register addresses (leaving FS and GS) can’t be used to access memory as there is no limit checking on them by hardware and using them can cause a potential security threat (memory violation). But some of the attributes of segments are still used like selecting the mode : long mode, or protected or real mode. LDTs and GDTs are also used just for backward compatibility. Also during virtual address translation to physical address, the address does go to the segmentation unit for translation to linear address, but the op has been made no op in 64 bit systems.

On 32 bit system, every address goes to segmentation unit to convert it to linear address and if then paging is enabled then linear address is translated to physical address and if no paging is being used then there is 1:1 correspondence between linear address and physical address.

64 bit kernel can also request for creation of LDTs while creating address space of new process, for compatibility purpose.

The problem that I was resolving today was that 64 bit sles 12 sp1 kernel was trying to use a page which has writeable mapping as LDT, which Xen doesn’t allow as LDT have call gates and if a writeable page is made and LDT then a guest can use the call gates in LDT for potential malicious purpose.

The below extract from http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management explain the reason why the above operation was blocked by Xen.

In order to ensure that the guest cannot subvert the system Xen requires that certain invariants are maintained and therefore that all updates to the page tables are vetted by Xen. Xen achieves this by requiring that page tables are always updated using a hypercall.
Xen defines a number of page types and maintains the invariant that any given page has exactly one type at any given time. The type of a page is reference counted and can only be changed when the type count is zero.

The basic types are:

None: No special uses.
LN Page table page : Pages used as a page table at level N. There are separate types for each of the 4 levels on 64-bit and 3 levels on 32-bit PAE guests.

Segment descriptor page:Pages used as part of the Global or Local Descriptor tables (GDT/LDT).

Writeable: Page is writable.

Xen enforces the invariant that only pages with the writable type have a writable mapping in the page tables. Likewise it ensures that no writable mapping exists of a page with any other type. It also enforces other invariants such as requiring that no page table page can make a non-privlieged mapping of the hypervisor’s virutal address space etc. By doing this it can ensure that the guest OS is not able to directly modify any critical data structures and therefore subvert the safety of the system, for example to map machine addresses which do not belong to it.

Whenever a set of page-tables is loaded into the hardware page-table base register (cr3) the hypervisor must take an appropriate type reference with the root page-table type (that is, an L4 type reference on 64-bit or an L3 type reference on 32-bit). If the page is not already of the required type then in order to take the initial reference it must first have a type count of zero (remember, a page’s type only be change while the type count is zero) and must be validated to ensure that it respects the invariants. For a page with a page table type to be valid it is required any pages referenced by a present page table entry in the page have the type of the next level down. So any page referenced by a page with type L4 Page Table must itself have type L3 Page Table. This invariant is applied recursively down the the L1 page table layer. At L1 the invariant is that any data page mapped by a writeble page table entry must have the Writeable type.
By applying these invariants Xen ensures that the set of page tables as a whole are safe to load into the page table base register.

Similar requirements are placed on other special page types which must likewise be validated and have a type count of the appropriate type taken before they can be passed to the hardware.”

Page type checking (mm.h)
#define BYTES_PER_LONG (1 << LONG_BYTEORDER) #define BITS_PER_LONG (BYTES_PER_LONG << 3) #define BITS_PER_BYTE 8 #define PG_shift(idx) (BITS_PER_LONG - (idx)) #define PG_mask(x, idx) (x ## UL << PG_shift(idx)) #define PGT_type_mask PG_mask(15, 4) /* Bits 28-31 or 60-63. (x & PGT_type_mask) == PGT_l2_page_table) x is page type. Thanks Anshul Makkar

Tags: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.