Building a RISC-V simulator in Rust - Part 1
As way to gain some more experience using Rust I decided to implement a RISC-V instruction set simulation in Rust that I have imaginatively called rrs, the Rust RISC-V Simulator. So far I’ve implemented the RV32IM instruction set (For those unfamiliar with RISC-V this is all the basic instructions, ALU operations, branching, load/store, multiplication/division etc without any of the OS/system level stuff like exceptions or memory management and none of the more complex/interesting extensions like atomics).
The full code for rrs can be found on github, this blog was written for the c2bb07e commit
If you’re not familiar with Rust it’s a compiled language initially developed by Mozilla as something that could replace C or C++ that provides various memory and thread safety guarantees to eliminate whole classes of bugs (e.g. buffer overruns). It can achieve this whilst still providing high performance. I won’t discuss Rust at length here just a few aspects that are relevant to rrs. The Rust Book is a great way to learn the language if you want to find out more. Learn Rust With Entirely Too Many Linked Lists is another good read.
My initial aims with rrs were to produce something simple and modular without a big focus on performance (though I didn’t want it to be too slow). Plus as this was a project to learn the language I wanted to use a reasonable set of Rust features to implement it.
To aid modularity it is split into two separate parts (crates in Rust terms),
rrs-cli
and rrs-lib
. rrs-lib
is a collection of components that can be
used to implement the ISS but offer flexibility in how they’re used. The idea
being that someone who wants ISS components but not just a monolithic ISS could
use rrs-lib
to implement whatever it is they’re doing. For example a cycle
accurate performance model of a CPU core would need to emulate RISC-V
instructions but needs much more besides. It could use rrs-lib
to implement
RISC-V emulation and build the model on top of it.
Another possible use would be co-simulation for validating a CPU design. The
simulation of the CPU (e.g. written in a hardware description language like
Verilog) runs alongside a RISC-V simulation (using rrs-lib
) being continuously
cross-checked ensuring instructions executed on the CPU do what the RISC-V
simulation says they should.
The second half is rrs-cli
. This is a command line application that uses
rrs-lib
to execute compiled RISC-V binaries. At time of writing it has a very
basic feature set. It can setup a simulated memory of any requested size at a
given address, load a raw binary into it and then run it. It provides a couple
of ‘devices’ that allow the program to output to stdout and terminate the
simulation.
Processing Instructions
There are various strategies for implementing a instruction set simulator. A
typical way is to use a giant switch case
statement at its core. An
instruction is fetched from a simulated memory, the opcode extracted, then the
switch case
used to execute the instruction based upon it’s opcode. In
pseudo-C:
|
|
This can be implemented using a structure called a ‘jump table’ where instead of having to execute many comparisons to determine which opcode you have a table of code addresses is constructed. The opcode is extracted from the instruction then used to lookup in this table and you jump to the location given. Where you have many opcodes this can provide optimal performance.
This switch case statement structure is what I’ll use in rrs-lib
though with a
slight modifications to aid modularity.
Rust doesn’t actually have a switch
statement, instead it has match
which
can do much the same thing but is far more powerful as it allows pattern
matching using Rust’s advanced type system. For the instruction processing this
pattern matching isn’t used but I’ll have an example using pattern matching
later.
To give the modularity I wanted I created a trait
which is a Rust concept
similar to an interface in other languages such as Java or C# or a pure virtual
class in C++. I called the trait InstructionProcessor
and it contains one
function per RISC-V instruction implemented. Here’s part of the definition. I’ve
removed much of it for compactness but you should get the idea
|
|
A trait
is implemented on a struct
which is Rust’s way of describing
structured data. You can implement functions for that struct
(similar to
classes in C++/C#/Java but Rust is not an object orientated language so they’re
not identical concepts). Each function in the InstructionProcessor
trait takes
a reference to the struct
it is implemented on and a decoded version of the
instruction to be processed and returns some result type
(Self::InstructionResult
). RISC-V has a variety of standard instruction
formats (RType, IType etc) and I created a Rust struct
for each of them. Each
struct
includes a function to create itself from a encoded instruction. Here’s
the RType
struct
definition:
|
|
The provides modularity as you can implement many different types of
InstructionProcessor
to do different things that operate on decoded
instructions. Instruction execution is one thing we want to do but there are
others . Disassembly is one example, which we can implement by building an
InstructionProcessor
that produces a string describing the instruction.
Another example would be an instruction profiler, something that counts the
number of times a particular instruction is used.
A process_instruction
function takes an encoded instruction and an
InstructionProcessor
and calls the appropriate process_X
function. This is
where we implemented our big match
statement like our pseudo-C example above.
As RISC-V decoding isn’t as simple as masking out a single set of bits to
identify what instruction to execute this is implemented over multiple
functions. Here’s process_instruction
in full:
|
|
At the top-level it reads an opcode and depending on which opcode it is we call
into another function that deals with that opcode, e.g. here’s part of
process_opcode_op
(which deals with the basic ALU instructions e.g. add x1, x2, x3
):
|
|
process_opcode_op
does another level of matching to work out what instruction
we have, decodes the instruction into an RType
structure and calls the
relevant function in the supplied InstructionProcessor
.
process_instruction
(and in turn the process_X
functions it calls) returns
an Option
type. This can be Some
or None
. It returns None
where the
instruction couldn’t be decoded (it’s an invalid instruction or from an
extension rrs doesn’t yet implement). When it returns Some
this includes the
InstructionResult
from the InstructionProcessor
.
Notably I’ve used Rust generics here. Say we implemented something like this in
C++ we could use an InstructionProcessor
class with all virtual functions.
We’d pass a pointer to the class to process_instruction
and it would call the
relevant one. This will work but forces a virtual function call for each
instruction processed, which is more expensive than a normal function call and
cannot be inlined.
We could do something similar in Rust but by using generics
process_instruction
knows what kind of InstructionProcessor
it is being used
with and can do a normal function call instead. This also gives the compiler the
opportunity to inline the function calls so our nice nested structure with the
modular InstructionProcessor
could turn into a single monolithic function in
machine code and remove all function call overhead. Whether it actually does
this (and whether that would actually be the most optimal thing to do) is
anyone’s guess but it offers the compiler good opportunities for optimisation. I
may delve into what code is actually being generated at a later date.
(Note you could do the same thing in C++ with templates)
Executing Instructions
With process_instruction
in place we can implement an InstructionProcessor
that executes instructions. First we need a struct
to represent our CPU state.
RISC-V uses the term ‘hart’ meaning hardware thread to mean a set of state we
execute instructions against, so that struct is called HartState
:
|
|
In order to implement the zero register (which ignores writes and always reads 0) two functions are implemented on this struct for reading and writing registers:
|
|
A separate struct
InstructionExecutor
will implement InstructionProcessor
.
We could implement this directly on HartState
but the separate struct
keeps
things decoupled.
|
|
We’ll discuss the mem
field (that uses the Memory
trait) below.
Before we start implementing the various process_X
functions we need to decide
what return type will be used. We’ll use the following:
type InstructionResult = Result<bool, InstructionException>;
The Result
can be an Ok
or an Err
. Where it is Ok
a bool
is provided.
This is true
when the instruction updates the PC (e.g. because it is a jump or
a taken branch) and false
otherwise. When the instruction execution results in
a exception the result is an Err
providing an InstructionException
which
describes the exception.
This done we can start implementing the process_X
instructions, which is
straight-forward but very repetitive, for example here’s an add
implementation:
|
|
This works fine but many other instructions will look almost identical, with
just the let result
line using a different operator. The addi
instruction
will also be very similar, this time using the same operation but taking the b
from an immediate rather than a register read. Can we minimise this repetition?
We’ll do this in two steps, the first is implementing functions on
InstructionExecutor
to do some of the repetitive work, such as:
|
|
This does the work of functions like process_add
above but it takes the
operation to perform as an argument (you provide it with a ‘closure’ which
effectively allows you to pass a function as an argument). process_add
can
then be implemented as:
|
|
As execute_reg_reg_op
uses generics the use of closures here shouldn’t cause a
performance issue. The compiler can inline the implementation rather than
performing an extra function call to compute the operation.
It’d be nice to cut down the boilerplate further, to do so we’ll turn to macros.
In C and other C like languages macros are purely text based, arguments passed
to a macro are textually inserted within it and whatever results fed to the
compiler. In Rust macros work a bit differently 1. They work on things called
‘token trees’ which are part of the Rust compiler’s parsing process. When
defining a macro we specify kinds of token tree its arguments must match against
(e.g. an expression or an identifier) and then an expansion for that match. We
can define a macro that produces functions like process_add
for us:
|
|
To implement add we’d use it like this:
make_alu_op_reg_fn! {add, |a, b| a + b}
Much more compact with a nice clear relation between the instruction name and the
operation it performs. We can go one step further to remove the repetition
between instructions like add
and addi
. Both implement the same operation
and given the name of one we can work out the name of the other. So we define
another macro that implements the register/register version and the
register/immediate version together
|
|
The make_alu_op_imm_fn
is much like the make_alu_op_reg_fn
above just
implementing the register/immediate version of the operation and adding an ‘i’
to the function name. With this all done we get a nice compact implementation of
the various register/register and register/immediate operations
|
|
Note the use of wrapping_add
and wrapping_sub
. This is due to Rust’s
approach to safety. It considers an overflow in an addition or a subtraction an
error, which it checks for in debug builds and causes a panic (which kills the
program with an error message) if it detects one. The wrapping_add
and
wrapping_sub
tell Rust we don’t consider overflows to be an error and want to
wrap around when one happens which matches the semantics of the RISC-V add and
sub instructions.
I’ll skip the explanations for the other instructions so check out the source if you want to see the details for those.
Disassembling Instructions
To implement instruction disassembly we implement a new InstructionProcessor
that has an InstructionResult
type of String
. We’ll implement
InstructionProcessor
on a new struct
InstructionStringOutputter
. As with
InstructionExecutor
the implementation of an individual instruction processing
function is straight-forward and repetitive:
|
|
Once again we turn to macros to cut down on the repetition. This time we’ve got
groups of function where the only thing changing is the function name and the
mnemonic (add
, sub
, or
etc). We can use another macro feature here,
variable arguments:
|
|
The string_out_for_alu_reg_op
implements our disassembly function for a
particular named instruction. string_out_for_alu_ops
takes multiple
instruction names and applies string_out_for_alu_reg_op
and
string_out_for_alu_imm_op
to each of them in turn (string_out_for_alu_imm_op
is the immediate processing version of string_out_for_alu_reg_op
). This gives
us a nice compact implementation of InstructionProcessor
:
|
|
This gives us something that given a 32-bit RISC-V instruction can output the disassembly. This can be used by the ISS to output disassembled instructions to a log file as they are executed like so:
|
|
Again I’ve skipped over implementations for the other instructions, check out the source to see how they work.
Memories
We need a way to implement a simulated memory. On top of ‘normal’ RAM, devices
are also represented as memories. For instance you could have a memory that when
written to outputs the byte written as a character to standard out. So we’ll
have a new trait Memory
so we can implement different kinds of memory.
|
|
A straight-forward interface with one function to read and another to write is
all we need, though it does have one peculiarity that’s needed due to how Rust
works and that’s the Downcast
bit.
Before explaining let’s take a look at a Memory
implementation, MemorySpace
.
The HartState
structure has a single reference to a Memory
that is used for
all load and store instructions but we may want multiple memories (e.g. a
‘normal’ RAM along with some devices). MemorySpace
allows us to do this by
holding multiple memories along with a base address and size. When it gets a
read_mem
or write_mem
call it works out which memory the address belongs to
and forward the call on. It uses the following function to add memories:
|
|
The Box<dyn Memory>
is a dynamic reference to a Memory
trait. Unlike when
you’re using generics add_memory
and MemorySpace
itself don’t know the
actual type being used so a dynamic dispatch mechanism is used when calling
read_mem
and write_mem
. This is useful here as it allows MemorySpace
to
have many different kinds of memory.
Users of rrs-lib
may want to retain a reference to any memory they add to a
MemorySpace
(if it’s a device there may be some other way it needs to be
interacted with or they may want ‘backdoor’ access to the memory rather than
using the read_mem
/write_mem
functions). In any other language this would be
simple, just hold a copy of the reference elsewhere, this is harder in Rust due
to its safety properties.
Rust provides automatic memory management without needing a garbage collector as well as providing various safety properties. It does this with some strict rules around what can have a reference to something and ensuring any allocated memory has a single owner.
A reference can be handed out to a structure provided that reference doesn’t out live the owner of the structure. Then when the owner goes out of scope Rust knows it can safely deallocate the memory. Multiple references can exist to a structure but only one mutable reference can exist (which allows you to alter the structure). Something called the ‘borrow checker’ enforces these rules ensuring reference lifetimes are strictly shorter or the same as owner lifetimes. The borrow checker runs at compile time. This gives automatic memory management without run-time overhead but does restrict what you’re able to do.
The add_memory
function takes ownership of whatever memory is added to it and
the borrow checker won’t allow whatever calls add_memory
to retain a reference
(mutable or otherwise) to the memory as the borrow checker can’t ensure the
MemorySpace
outlives the reference at compile time.
The way around this is to add functionality to MemorySpace
which allows you to
get a reference or a mutable reference to a particular memory. add_memory
returns an index (the usize
in the Result
) which can be used to get a
reference to the memory again at a later point in the program.
We’re not quite done though, this is where the Downcast
is important 2. As you
might expect Rust doesn’t just let you convert between types as you wish. In C++
for instance if we had a Memory*
we could convert it to some other
SpecificMemory*
. In Rust we cannot just convert our Box<dyn Memory>
into a
Box<SpecificMemory>
or a & SpecificMemory
. The Downcast
bit in our
Memory
allows us to do this though, in a safe, checked way. Here’s the
implementation of get_memory_ref
that gives you a reference to a previously
added memory.
|
|
This is a generic function, a caller specifies the type they expect a memory to
be and it returns an Option
which will be Some
if the memory is of that
type, providing a reference to it and None
if the memory is of a different
type or the index doesn’t exist.
Bringing it all together
We’ve now got all the bits and pieces we need to execute a RISC-V binary we just
need to bring them together. First let’s look at the step
function of
InstructionExecutor
. This fetches an instruction from memory, executes it,
then increments the PC to point to the next instruction as needed:
|
|
This uses the pattern matching with the match
statement I reference above.
process_instructon
returns a nested data type Option<Result<bool, InstructionException>>
. The outer option is None
if the instruction can’t be
decoded. If it is Some
we either have a Ok
in the inner result or an Err
.
We can use a single match
to cover all possibilities by using pattern matching
on the inner structure.
That’s it for functionality in rrs-lib
. I won’t go over all the details of how
rrs-cli
works let’s just take a look at the function that runs the simulation:
|
|
It sets up an InstructionExecutor
with a supplied HartState
and
MemorySpace
which are created by rrs-cli
before it calls run_sim
. It
enters a loop and steps the executor optionally writing disassembly (using
InstructionstringOutputter
) and register writes out to a log. There is a
special SimulationCtrlDevice
memory that we can write to to stop the
simulation. MemorySpace::get_memory_ref
is used to access this and see if a
stop has been asked for terminating the simulation if so.
Performance and Testing
Finally a word on the performance of rrs and the testing methodology.
As way to test performance and test the implementation I ran the CoreMark benchmark on rrs. The actual benchmark result is meaningless but what is interesting is how many instructions per second rrs can execute. A run of a 1000 CoreMark iterations on my Thinkpad 490s (with an Intel i7-8565U) executes 275063040 instructions in 5.28s, giving us an instructions per second of 52.1 MHz. My CPU has a turbo frequency of 4.6 GHz and a 5 way decode so could execute at most 23 Billion instructions per second giving a ratio of ~440 real CPU instructions per emulated RISC-V instruction and in reality we won’t be near that theoretical max instructions per second so the ratio may be more like 200 or lower. A pretty decent result given I wasn’t trying hard to achieve high performance. Room for improvement though, at a later date I may take a look at how to improve performance.
I ran the same Coremark binary on
Spike using time
to measure
execution time, this gave a wall clock time of 8.852 vs 5.376 for rrs-cli (this
is total run time including initialisation), so rrs is ~60% faster. Spike is a
far more complete and featureful simulator so it’s not a entirely fair
comparison but still rrs stacks up well.
CoreMark also helps test rrs. It computes a checksum during the benchmark and checks against an expected value. This would fail if the benchmark executed incorrectly. CoreMark should give a reasonable workout of the RV32IM instruction set so this is a decent test of functionality.
On top of running CoreMark I’ve got a small set of tests in Rust. So far these test:
- The instruction format structures (e.g.
instruction_formats::RType
) decode instructions correctly. InstructionStringOuptter
produces a correct string for all implemented instructions. This checks that bothInstructionStringOutputter
andprocess_instruction
are doing the right thing.InstructionExecutor
correctly executes a small program that uses each ‘kind’ of instruction (a branch and a load are different kinds, an add and a sub are not).- The various
Memory
implementations function correctly.
In particular I don’t have any Rust tests exhaustively testing the execution of all instructions. Instead I plan to use tests like CoreMark and the RISC-V Architectural Test Framework. I’d want to run these tests regardless of other tests (as demonstrating succesful execution of binaries is clearly important) and separate per instruction tests in Rust would mostly duplicate this so I decided to spend effort elsewhere.
What’s Next
There’s a few things I’d like to work on next in the rough order I’ll tackle them:
- An example RISC-V software build flow
- Run the RISC-V Architectural Test Framework
- Implement exception handling
- Implement page tables
- Implement a device tree for rrs-cli either the DTS format used by Linux or something custom.
- Get xv6-riscv running
- Implement debug support
- Get Linux running?
In doing some of these I’ll likely need to implement some more extensions (such as F for floating point and A for atomics).
-
The Little Book of Rust Macros is good guide to Rust macros if you want to learn more ↩︎
-
This isn’t a native language feature, it’s a external crate downcast-rs ↩︎