Lab3 是体验和操作 Page Table。通过这个 Lab 终于能有机会对 Page Table 有了更深刻的理解。
一点感想,其实完成一个 Lab 需要看课、阅读 textbook、阅读 Xv6 代码、写实验代码、写笔记,总共五大步骤,写实验代码所用的时间反而占比不多,只能说写代码是整个 Lab 过程中最快乐、成就感最强的阶段吧。
Xv6 地址空间与页表
这里先放几个很重要、很经典的图,这对整个 Lab 乃至 Xv6 的理解都很重要。后面 Xv6 教材精读笔记部分就不再放了。
内核地址空间
这张图有个 bug,顶部地址空间没有画 TRAPFRAME
。这也会导致后面你在看 KSTACK(p)
的时候会比较疑惑,为什么是 (TRAMPOLINE - (p)*2*PGSIZE - 3*PGSIZE)
,这里面的 3
就是因为它需要跳过 TRAPFRAME
、Guard page
以及 Kstack 0
本身。
🌟 在 Xv6 地址空间中,TRAMPOLINE
和 TRAPFRAME
同时映射到内核空间和用户空间,但是用户空间没有权限访问。
用户地址空间
用户地址空间的代码、数据都放在 0 开始的一段低地址空间里。
页表和页表项
页表项的低 10 位是 flags 标志位,其中 RWX
位只在页表 leaf entry 才有,表示相应 Page 的访问权限。Valid
标志位表示当前是否是一个合法的页表项,可以理解为页表项是否已经被创建(存在)。
Lab 3
Speed up system calls
这个 task 的难点在于,你应该把 map page 这段代码放在哪里最合适?Lab 文档里说的是 When each process is created,具体是什么时候呢?是放在 fork
的地方,还是放在 allocproc
里面,好像都很有道理,但其实这两处都不对。
一开始我决定把代码放在 allocproc
里面,因为 allocproc
对于进程创建都需要调用,甚至第一个进程 /init
也需要调用它,放在这里面可以覆盖到所有用户进程。
当我写完代码后,
$ make qemu
riscv64-linux-gnu-gcc -Wall -Werror -O -fno-omit-frame-pointer -ggdb -gdwarf-2 -DSOL_PGTBL -DLAB_PGTBL -MD -mcmodel=medany -ffreestanding -fno-common -nostdlib -mno-relax -I. -fno-stack-protector -fno-pie -no-pie -c -o kernel/proc.o kernel/proc.c
riscv64-linux-gnu-ld -z max-page-size=4096 -T kernel/kernel.ld -o kernel/kernel kernel/entry.o kernel/kalloc.o kernel/string.o kernel/main.o kernel/vm.o kernel/proc.o kernel/swtch.o kernel/trampoline.o kernel/trap.o kernel/syscall.o kernel/sysproc.o kernel/bio.o kernel/fs.o kernel/log.o kernel/sleeplock.o kernel/file.o kernel/pipe.o kernel/exec.o kernel/sysfile.o kernel/kernelvec.o kernel/plic.o kernel/virtio_disk.o kernel/start.o kernel/console.o kernel/printf.o kernel/uart.o kernel/spinlock.o
riscv64-linux-gnu-objdump -S kernel/kernel > kernel/kernel.asm
riscv64-linux-gnu-objdump -t kernel/kernel | sed '1,/SYMBOL TABLE/d; s/ .* / /; /^$/d' > kernel/kernel.sym
qemu-system-riscv64 -machine virt -bios none -kernel kernel/kernel -m 128M -smp 3 -nographic -global virtio-mmio.force-legacy=false -drive file=fs.img,if=none,format=raw,id=x0 -device virtio-blk-device,drive=x0,bus=virtio-mmio-bus.0
xv6 kernel is booting
hart 1 starting
hart 2 starting
panic: freewalk: leaf
通过 bt
可以知道在 proc_freepagetable -> uvmfree -> freewalk
的时候挂了(写笔记的时候没有调试截图就懒得再截图了)。因为 freewalk
需要保证 “All leaf mappings must already have been removed.”。
这就回到前文里提到的,用户地址空间的代码、数据都放在 0 开始的一段低地址空间里,而 TRAMPOLINE
、TRAPFRAME
以及现在映射的 USYSCALL
,都位于高地址空间,在 proc_freepagetable
的时候需要手动释放。
与此同时,我也从 proc_pagetable
和 proc_freepagetable
代码里面学习到了 TRAMPOLINE
、TRAPFRAME
的正确映射方式和映射生命周期。所以我们应该把映射 USYSCALL
的代码放在这里面。
这也回答了我看 Lab 文档的一个疑惑,For inspiration, understand the trapframe handling in kernel/proc.c
,当时一直觉得这里写错了,trapframe handling 不应该在 kernel/trap.c
里嘛。
代码如下:
diff --git a/kernel/proc.c b/kernel/proc.c
index 58a8a0b..146f4e2 100644
--- a/kernel/proc.c
+++ b/kernel/proc.c
@@ -177,6 +177,7 @@ pagetable_t
proc_pagetable(struct proc *p)
{
pagetable_t pagetable;
+ char *mem;
// An empty page table.
pagetable = uvmcreate();
@@ -202,6 +203,26 @@ proc_pagetable(struct proc *p)
return 0;
}
+ // Map one read-only page at USYSCALL.
+ // At the start of this page, store a struct usyscall,
+ // and initialize it to store the PID of the current process.
+ mem = kalloc();
+ if (mem == 0) {
+ uvmunmap(pagetable, TRAMPOLINE, 1, 0);
+ uvmunmap(pagetable, TRAPFRAME, 1, 0);
+ uvmfree(pagetable, 0);
+ return 0;
+ }
+ if (mappages(pagetable, USYSCALL, PGSIZE,
+ (uint64)mem, PTE_R | PTE_U) < 0) {
+ kfree(mem);
+ uvmunmap(pagetable, TRAMPOLINE, 1, 0);
+ uvmunmap(pagetable, TRAPFRAME, 1, 0);
+ uvmfree(pagetable, 0);
+ return 0;
+ }
+ *(struct usyscall*)mem = (struct usyscall){ .pid = p->pid };
+
return pagetable;
}
@@ -212,6 +233,7 @@ proc_freepagetable(pagetable_t pagetable, uint64 sz)
{
uvmunmap(pagetable, TRAMPOLINE, 1, 0);
uvmunmap(pagetable, TRAPFRAME, 1, 0);
+ uvmunmap(pagetable, USYSCALL, 1, 1);
uvmfree(pagetable, sz);
}
Print a page table
一开始想写个迭代非递归的,构想了下代码没法做到很优雅,从通用性和优雅的角度还是算了。由于第一行需要输出 printf("page table %p\n", pagetable);
,因此没法一个函数实现递归,还需要一个 vmprint_raw
真正实现递归遍历打印的辅助函数。同时我记录了一个深度 depth
,便于输出 indent。
与 Lab 文档的有点小区别,虚拟地址空间的值并不完全相同,
xv6 kernel is booting
hart 1 starting
hart 2 starting
page table 0x0000000087f6b000
..0: pte 0x0000000021fd9801 pa 0x0000000087f66000
.. ..0: pte 0x0000000021fd9401 pa 0x0000000087f65000
.. .. ..0: pte 0x0000000021fd9c1b pa 0x0000000087f67000
.. .. ..1: pte 0x0000000021fd9017 pa 0x0000000087f64000
.. .. ..2: pte 0x0000000021fd8c07 pa 0x0000000087f63000
.. .. ..3: pte 0x0000000021fd8817 pa 0x0000000087f62000
..255: pte 0x0000000021fda801 pa 0x0000000087f6a000
.. ..511: pte 0x0000000021fda401 pa 0x0000000087f69000
.. .. ..509: pte 0x0000000021fda013 pa 0x0000000087f68000
.. .. ..510: pte 0x0000000021fdd007 pa 0x0000000087f74000
.. .. ..511: pte 0x0000000020001c0b pa 0x0000000080007000
init: starting sh
代码如下:
diff --git a/kernel/defs.h b/kernel/defs.h
index a3c962b..e1730f9 100644
--- a/kernel/defs.h
+++ b/kernel/defs.h
@@ -170,6 +170,7 @@ void uvmunmap(pagetable_t, uint64, uint64, int);
void uvmclear(pagetable_t, uint64);
pte_t * walk(pagetable_t, uint64, int);
uint64 walkaddr(pagetable_t, uint64);
+void vmprint(pagetable_t);
int copyout(pagetable_t, uint64, char *, uint64);
int copyin(pagetable_t, char *, uint64, uint64);
int copyinstr(pagetable_t, char *, uint64, uint64);
diff --git a/kernel/exec.c b/kernel/exec.c
index e18bbb6..463d383 100644
--- a/kernel/exec.c
+++ b/kernel/exec.c
@@ -128,6 +128,10 @@ exec(char *path, char **argv)
p->trapframe->sp = sp; // initial stack pointer
proc_freepagetable(oldpagetable, oldsz);
+ if (p->pid == 1) {
+ vmprint(p->pagetable);
+ }
+
return argc; // this ends up in a0, the first argument to main(argc, argv)
bad:
diff --git a/kernel/vm.c b/kernel/vm.c
index 5c31e87..26a9f59 100644
--- a/kernel/vm.c
+++ b/kernel/vm.c
@@ -293,6 +293,32 @@ freewalk(pagetable_t pagetable)
kfree((void*)pagetable);
}
+// Recursively walk a page table
+// and print the contents of a page table.
+static void
+vmprint_raw(pagetable_t pagetable, int depth) {
+ for (int i = 0; i < 512; ++i) {
+ pte_t pte = pagetable[i];
+ if (pte & PTE_V) {
+ for (int j = 0; j < depth; ++j) {
+ printf(" ..");
+ }
+ printf("%d: pte %p pa %p\n", i, pte, PTE2PA(pte));
+ if (depth <= 2) { // if current is not a leaf page table
+ uint64 child = PTE2PA(pte);
+ vmprint_raw((pagetable_t)child, depth + 1);
+ }
+ }
+ }
+}
+
+// Print the contents of a page table.
+void
+vmprint(pagetable_t pagetable) {
+ printf("page table %p\n", pagetable);
+ vmprint_raw(pagetable, 1);
+}
+
// Free user memory pages,
// then free page-table pages.
void
Detect which pages have been accessed
虽然难度标注为 Hard,但其实真没有 Hard 的难度,代码量也不大。
在 RISC-V Privileged 手册中可以找到 A
和 D
bit 的定义如下,之前学习 Linux 内核也有所了解,
下图是 RISC-V Sv39 的 Page Table Entry 结构图,Sv39 指用户地址空间是 39bit,Xv6 正是使用的 Sv39。satp
是页表地址寄存器。
个人代码在实现的时候设置了一个最大检测的 page 数量,即一个 page 的大小 4096。
代码如下:
diff --git a/kernel/defs.h b/kernel/defs.h
index e1730f9..2f86786 100644
--- a/kernel/defs.h
+++ b/kernel/defs.h
@@ -106,6 +106,7 @@ void yield(void);
int either_copyout(int user_dst, uint64 dst, void *src, uint64 len);
int either_copyin(void *dst, int user_src, uint64 src, uint64 len);
void procdump(void);
+int pgaccess(uint64 base, int len, uint64 mask);
// swtch.S
void swtch(struct context*, struct context*);
diff --git a/kernel/proc.c b/kernel/proc.c
index 146f4e2..25ce69f 100644
--- a/kernel/proc.c
+++ b/kernel/proc.c
@@ -708,3 +708,43 @@ procdump(void)
printf("\n");
}
}
+
+// Reports which pages have been accessed
+// and clear PTE_A after checking.
+int
+pgaccess(uint64 base, int len, uint64 mask) {
+ struct proc *p = myproc();
+ uint64 start, end;
+ int i, bit, rc;
+ uint8 *buf, *state;
+ pte_t *pte;
+
+ state = buf = kalloc();
+ if (buf == 0) {
+ return -1;
+ }
+
+ bit = *state = 0;
+ start = PGROUNDDOWN(base);
+ end = start + len * PGSIZE;
+ for (i = start; i < end; i += PGSIZE) {
+ pte = walk(p->pagetable, i, 0);
+ if (pte == 0) {
+ kfree(buf);
+ return -1;
+ }
+ if (*pte & PTE_A) {
+ *state |= 1 << bit;
+ *pte &= ~PTE_A;
+ }
+ if (++bit == 8) {
+ ++state;
+ bit = *state = 0;
+ }
+ }
+
+ rc = copyout(p->pagetable, mask, (char *)buf, (len + 8 - 1) / 8);
+ kfree(buf);
+
+ return rc;
+}
\ No newline at end of file
diff --git a/kernel/riscv.h b/kernel/riscv.h
index 20a01db..e71c193 100644
--- a/kernel/riscv.h
+++ b/kernel/riscv.h
@@ -343,6 +343,7 @@ typedef uint64 *pagetable_t; // 512 PTEs
#define PTE_W (1L << 2)
#define PTE_X (1L << 3)
#define PTE_U (1L << 4) // user can access
+#define PTE_A (1L << 6) // accessed
// shift a physical address to the right place for a PTE.
#define PA2PTE(pa) ((((uint64)pa) >> 12) << 10)
diff --git a/kernel/sysproc.c b/kernel/sysproc.c
index 88644b2..6a2b5e3 100644
--- a/kernel/sysproc.c
+++ b/kernel/sysproc.c
@@ -71,11 +71,21 @@ sys_sleep(void)
#ifdef LAB_PGTBL
+#define MAXLEN 4096
+
int
-sys_pgaccess(void)
-{
- // lab pgtbl: your code here.
- return 0;
+sys_pgaccess(void) {
+ uint64 base, mask;
+ int len;
+
+ argaddr(0, &base);
+ argint(1, &len);
+ argaddr(2, &mask);
+
+ if (len <= 0 || len > MAXLEN)
+ return -1;
+
+ return pgaccess(base, len, mask);
}
#endif
测试
(不知道为啥 usertests 跑得恁慢)
Xv6 Book Chapter 3
Xv6 金句摘录,以下仅记录 Xv6 book 中笔者觉得 make sense 或者有用或者自己不太熟悉的部分:
The designers of RISC-V chose these numbers based on technology predictions. $2^{39}$ bytes is 512 GB, which should be enough address space for applications running on RISC-V computers.
If any of the three PTEs required to translate an address is not present, the paging hardware raises a page-fault exception, leaving it up to the kernel to handle the exception (see Chapter 4).
To tell a CPU to use a page table, the kernel must write the physical address of the root pagetable page into the satp
register.
QEMU exposes the device interfaces to software as memory-mapped control registers that sit below 0x80000000
in the physical address space. The kernel can interact with the devices by reading/writing these special physical addresses; such reads and writes communicate with the device hardware rather than with RAM.
The kernel gets at RAM and memory-mapped device registers using “direct mapping;” that is, mapping the resources at virtual addresses that are equal to the physical address.
A physical page (holding the trampoline code) is mapped twice in the virtual address space of the kernel: once at top of the virtual address space and once with a direct mapping.
The central functions are walk
, which finds the PTE for a virtual address, and mappages
, which installs PTEs for new mappings.
The translations include the kernel’s instructions and data, physical memory up to PHYSTOP
, and memory ranges which are actually devices.(注意,虚拟地址开启后,汇编指令/代码的地址也会被翻译)
When xv6 changes a page table, it must tell the CPU to invalidate corresponding cached TLB entries.
It is also necessary to issue sfence.vma
before changing satp
, in order to wait for completion of all outstanding loads and stores.
Xv6 ought to determine how much physical memory is available by parsing configuration information provided by the hardware. Instead xv6 assumes that the machine has 128 megabytes of RAM.
Xv6 leaves PTE_V
clear in unused PTEs.
Xv6 binaries are formatted in the widely-used ELF format, defined in (kernel/elf.h
).
A program section header’s filesz may be less than the memsz, indicating that the gap between them should be filled with zeroes (for C global variables) rather than read from the file.
exec
must wait to free the old image until it is sure that the system call will succeed: if the old image is gone, the system call cannot return -1 to it.
More serious kernel designs exploit the page table to turn arbitrary hardware physical memory layouts into predictable kernel virtual address layouts.