Building a RISC-V simulator in Rust - Part 1
As way to gain some more experience using Rust I decided to implement a RISC-V instruction set simulation in Rust that I have imaginatively called rrs, the Rust RISC-V Simulator. So far I’ve implemented the RV32IM instruction set (For those unfamiliar with RISC-V this is all the basic instructions, ALU operations, branching, load/store, multiplication/division etc without any of the OS/system level stuff like exceptions or memory management and none of the more complex/interesting extensions like atomics).
The full code for rrs can be found on github, this blog was written for the c2bb07e commit
If you’re not familiar with Rust it’s a compiled language initially developed by Mozilla as something that could replace C or C++ that provides various memory and thread safety guarantees to eliminate whole classes of bugs (e.g. buffer overruns). It can achieve this whilst still providing high performance. I won’t discuss Rust at length here just a few aspects that are relevant to rrs. The Rust Book is a great way to learn the language if you want to find out more. Learn Rust With Entirely Too Many Linked Lists is another good read.
My initial aims with rrs were to produce something simple and modular without a big focus on performance (though I didn’t want it to be too slow). Plus as this was a project to learn the language I wanted to use a reasonable set of Rust features to implement it.
To aid modularity it is split into two separate parts (crates in Rust terms),
rrs-cli and rrs-lib. rrs-lib is a collection of components that can be
used to implement the ISS but offer flexibility in how they’re used. The idea
being that someone who wants ISS components but not just a monolithic ISS could
use rrs-lib to implement whatever it is they’re doing. For example a cycle
accurate performance model of a CPU core would need to emulate RISC-V
instructions but needs much more besides. It could use rrs-lib to implement
RISC-V emulation and build the model on top of it.
Another possible use would be co-simulation for validating a CPU design. The
simulation of the CPU (e.g. written in a hardware description language like
Verilog) runs alongside a RISC-V simulation (using rrs-lib) being continuously
cross-checked ensuring instructions executed on the CPU do what the RISC-V
simulation says they should.
The second half is rrs-cli. This is a command line application that uses
rrs-lib to execute compiled RISC-V binaries. At time of writing it has a very
basic feature set. It can setup a simulated memory of any requested size at a
given address, load a raw binary into it and then run it. It provides a couple
of ‘devices’ that allow the program to output to stdout and terminate the
simulation.
Processing Instructions
There are various strategies for implementing a instruction set simulator. A
typical way is to use a giant switch case statement at its core. An
instruction is fetched from a simulated memory, the opcode extracted, then the
switch case used to execute the instruction based upon it’s opcode. In
pseudo-C:
 | 
 | 
This can be implemented using a structure called a ‘jump table’ where instead of having to execute many comparisons to determine which opcode you have a table of code addresses is constructed. The opcode is extracted from the instruction then used to lookup in this table and you jump to the location given. Where you have many opcodes this can provide optimal performance.
This switch case statement structure is what I’ll use in rrs-lib though with a
slight modifications to aid modularity.
Rust doesn’t actually have a switch statement, instead it has match which
can do much the same thing but is far more powerful as it allows pattern
matching using Rust’s advanced type system. For the instruction processing this
pattern matching isn’t used but I’ll have an example using pattern matching
later.
To give the modularity I wanted I created a trait which is a Rust concept
similar to an interface in other languages such as Java or C# or a pure virtual
class in C++. I called the trait InstructionProcessor and it contains one
function per RISC-V instruction implemented. Here’s part of the definition. I’ve
removed much of it for compactness but you should get the idea
 | 
 | 
A trait is implemented on a struct which is Rust’s way of describing
structured data.  You can implement functions for that struct (similar to
classes in C++/C#/Java but Rust is not an object orientated language so they’re
not identical concepts). Each function in the InstructionProcessor trait takes
a reference to the struct it is implemented on and a decoded version of the
instruction to be processed and returns some result type
(Self::InstructionResult). RISC-V has a variety of standard instruction
formats (RType, IType etc) and I created a Rust struct for each of them. Each
struct includes a function to create itself from a encoded instruction. Here’s
the RType struct definition:
 | 
 | 
The provides modularity as you can implement many different types of
InstructionProcessor to do different things that operate on decoded
instructions. Instruction execution is one thing we want to do but there are
others . Disassembly is one example, which we can implement by building an
InstructionProcessor that produces a string describing the instruction.
Another example would be an instruction profiler, something that counts the
number of times a particular instruction is used.
A process_instruction function takes an encoded instruction and an
InstructionProcessor and calls the appropriate process_X function. This is
where we implemented our big match statement like our pseudo-C example above.
As RISC-V decoding isn’t as simple as masking out a single set of bits to
identify what instruction to execute this is implemented over multiple
functions. Here’s process_instruction in full:
 | 
 | 
At the top-level it reads an opcode and depending on which opcode it is we call
into another function that deals with that opcode, e.g. here’s part of
process_opcode_op (which deals with the basic ALU instructions e.g. add x1, x2, x3):
 | 
 | 
process_opcode_op does another level of matching to work out what instruction
we have, decodes the instruction into an RType structure and calls the
relevant function in the supplied InstructionProcessor.
process_instruction (and in turn the process_X functions it calls) returns
an Option type. This can be Some or None. It returns None where the
instruction couldn’t be decoded (it’s an invalid instruction or from an
extension rrs doesn’t yet implement). When it returns Some this includes the
InstructionResult from the InstructionProcessor.
Notably I’ve used Rust generics here. Say we implemented something like this in
C++ we could use an InstructionProcessor class with all virtual functions.
We’d pass a pointer to the class to process_instruction and it would call the
relevant one. This will work but forces a virtual function call for each
instruction processed, which is more expensive than a normal function call and
cannot be inlined.
We could do something similar in Rust but by using generics
process_instruction knows what kind of InstructionProcessor it is being used
with and can do a normal function call instead. This also gives the compiler the
opportunity to inline the function calls so our nice nested structure with the
modular InstructionProcessor could turn into a single monolithic function in
machine code and remove all function call overhead.  Whether it actually does
this (and whether that would actually be the most optimal thing to do) is
anyone’s guess but it offers the compiler good opportunities for optimisation. I
may delve into what code is actually being generated at a later date.
(Note you could do the same thing in C++ with templates)
Executing Instructions
With process_instruction in place we can implement an InstructionProcessor
that executes instructions. First we need a struct to represent our CPU state.
RISC-V uses the term ‘hart’ meaning hardware thread to mean a set of state we
execute instructions against, so that struct is called HartState:
 | 
 | 
In order to implement the zero register (which ignores writes and always reads 0) two functions are implemented on this struct for reading and writing registers:
 | 
 | 
A separate struct InstructionExecutor will implement InstructionProcessor.
We could implement this directly on HartState but the separate struct keeps
things decoupled.
 | 
 | 
We’ll discuss the mem field (that uses the Memory trait) below.
Before we start implementing the various process_X functions we need to decide
what return type will be used. We’ll use the following:
type InstructionResult = Result<bool, InstructionException>;
The Result can be an Ok or an Err. Where it is Ok a bool is provided.
This is true when the instruction updates the PC (e.g. because it is a jump or
a taken branch) and false otherwise. When the instruction execution results in
a exception the result is an Err providing an InstructionException which
describes the exception.
This done we can start implementing the process_X instructions, which is
straight-forward but very repetitive, for example here’s an add
implementation:
 | 
 | 
This works fine but many other instructions will look almost identical, with
just the let result line using a different operator. The addi instruction
will also be very similar, this time using the same operation but taking the b
from an immediate rather than a register read. Can we minimise this repetition?
We’ll do this in two steps, the first is implementing functions on
InstructionExecutor to do some of the repetitive work, such as:
 | 
 | 
This does the work of functions like process_add above but it takes the
operation to perform as an argument (you provide it with a ‘closure’ which
effectively allows you to pass a function as an argument). process_add can
then be implemented as:
 | 
 | 
As execute_reg_reg_op uses generics the use of closures here shouldn’t cause a
performance issue. The compiler can inline the implementation rather than
performing an extra function call to compute the operation.
It’d be nice to cut down the boilerplate further, to do so we’ll turn to macros.
In C and other C like languages macros are purely text based, arguments passed
to a macro are textually inserted within it and whatever results fed to the
compiler. In Rust macros work a bit differently 1. They work on things called
‘token trees’ which are part of the Rust compiler’s parsing process. When
defining a macro we specify kinds of token tree its arguments must match against
(e.g. an expression or an identifier) and then an expansion for that match. We
can define a macro that produces functions like process_add for us:
 | 
 | 
To implement add we’d use it like this:
make_alu_op_reg_fn! {add, |a, b| a + b}
Much more compact with a nice clear relation between the instruction name and the
operation it performs. We can go one step further to remove the repetition
between instructions like add and addi. Both implement the same operation
and given the name of one we can work out the name of the other. So we define
another macro that implements the register/register version and the
register/immediate version together
 | 
 | 
The make_alu_op_imm_fn is much like the make_alu_op_reg_fn above just
implementing the register/immediate version of the operation and adding an ‘i’
to the function name. With this all done we get a nice compact implementation of
the various register/register and register/immediate operations
 | 
 | 
Note the use of wrapping_add and wrapping_sub. This is due to Rust’s
approach to safety. It considers an overflow in an addition or a subtraction an
error, which it checks for in debug builds and causes a panic (which kills the
program with an error message) if it detects one. The wrapping_add and
wrapping_sub tell Rust we don’t consider overflows to be an error and want to
wrap around when one happens which matches the semantics of the RISC-V add and
sub instructions.
I’ll skip the explanations for the other instructions so check out the source if you want to see the details for those.
Disassembling Instructions
To implement instruction disassembly we implement a new InstructionProcessor
that has an InstructionResult type of String. We’ll implement
InstructionProcessor on a new struct InstructionStringOutputter. As with
InstructionExecutor the implementation of an individual instruction processing
function is straight-forward and repetitive:
 | 
 | 
Once again we turn to macros to cut down on the repetition. This time we’ve got
groups of function where the only thing changing is the function name and the
mnemonic (add, sub, or etc). We can use another macro feature here,
variable arguments:
 | 
 | 
The string_out_for_alu_reg_op implements our disassembly function for a
particular named instruction. string_out_for_alu_ops takes multiple
instruction names and applies string_out_for_alu_reg_op and
string_out_for_alu_imm_op to each of them in turn (string_out_for_alu_imm_op
is the immediate processing version of string_out_for_alu_reg_op). This gives
us a nice compact implementation of InstructionProcessor:
 | 
 | 
This gives us something that given a 32-bit RISC-V instruction can output the disassembly. This can be used by the ISS to output disassembled instructions to a log file as they are executed like so:
 | 
 | 
Again I’ve skipped over implementations for the other instructions, check out the source to see how they work.
Memories
We need a way to implement a simulated memory. On top of ‘normal’ RAM, devices
are also represented as memories. For instance you could have a memory that when
written to outputs the byte written as a character to standard out. So we’ll
have a new trait Memory so we can implement different kinds of memory.
 | 
 | 
A straight-forward interface with one function to read and another to write is
all we need, though it does have one peculiarity that’s needed due to how Rust
works and that’s the Downcast bit.
Before explaining let’s take a look at a Memory implementation, MemorySpace.
The HartState structure has a single reference to a Memory that is used for
all load and store instructions but we may want multiple memories (e.g. a
‘normal’ RAM along with some devices). MemorySpace allows us to do this by
holding multiple memories along with a base address and size. When it gets a
read_mem or write_mem call it works out which memory the address belongs to
and forward the call on.  It uses the following function to add memories:
 | 
 | 
The Box<dyn Memory> is a dynamic reference to a Memory trait. Unlike when
you’re using generics add_memory and MemorySpace itself don’t know the
actual type being used so a dynamic dispatch mechanism is used when calling
read_mem and write_mem. This is useful here as it allows MemorySpace to
have many different kinds of memory.
Users of rrs-lib may want to retain a reference to any memory they add to a
MemorySpace (if it’s a device there may be some other way it needs to be
interacted with or they may want ‘backdoor’ access to the memory rather than
using the read_mem/write_mem functions). In any other language this would be
simple, just hold a copy of the reference elsewhere, this is harder in Rust due
to its safety properties.
Rust provides automatic memory management without needing a garbage collector as well as providing various safety properties. It does this with some strict rules around what can have a reference to something and ensuring any allocated memory has a single owner.
A reference can be handed out to a structure provided that reference doesn’t out live the owner of the structure. Then when the owner goes out of scope Rust knows it can safely deallocate the memory. Multiple references can exist to a structure but only one mutable reference can exist (which allows you to alter the structure). Something called the ‘borrow checker’ enforces these rules ensuring reference lifetimes are strictly shorter or the same as owner lifetimes. The borrow checker runs at compile time. This gives automatic memory management without run-time overhead but does restrict what you’re able to do.
The add_memory function takes ownership of whatever memory is added to it and
the borrow checker won’t allow whatever calls add_memory to retain a reference
(mutable or otherwise) to the memory as the borrow checker can’t ensure the
MemorySpace outlives the reference at compile time.
The way around this is to add functionality to MemorySpace which allows you to
get a reference or a mutable reference to a particular memory. add_memory
returns an index (the usize in the Result) which can be used to get a
reference to the memory again at a later point in the program.
We’re not quite done though, this is where the Downcast is important 2. As you
might expect Rust doesn’t just let you convert between types as you wish. In C++
for instance if we had a Memory* we could convert it to some other
SpecificMemory*. In Rust we cannot just convert our Box<dyn Memory> into a
Box<SpecificMemory> or a & SpecificMemory. The Downcast bit in our
Memory allows us to do this though, in a safe, checked way. Here’s the
implementation of get_memory_ref that gives you a reference to a previously
added memory.
 | 
 | 
This is a generic function, a caller specifies the type they expect a memory to
be and it returns an Option which will be Some if the memory is of that
type, providing a reference to it and None if the memory is of a different
type or the index doesn’t exist.
Bringing it all together
We’ve now got all the bits and pieces we need to execute a RISC-V binary we just
need to bring them together. First let’s look at the step function of
InstructionExecutor. This fetches an instruction from memory, executes it,
then increments the PC to point to the next instruction as needed:
 | 
 | 
This uses the pattern matching with the match statement I reference above.
process_instructon returns a nested data type Option<Result<bool, InstructionException>>. The outer option is None if the instruction can’t be
decoded. If it is Some we either have a Ok in the inner result or an Err.
We can use a single match to cover all possibilities by using pattern matching
on the inner structure.
That’s it for functionality in rrs-lib. I won’t go over all the details of how
rrs-cli works let’s just take a look at the function that runs the simulation:
 | 
 | 
It sets up an InstructionExecutor with a supplied HartState and
MemorySpace which are created by rrs-cli before it calls run_sim. It
enters a loop and steps the executor optionally writing disassembly (using
InstructionstringOutputter) and register writes out to a log. There is a
special SimulationCtrlDevice memory that we can write to to stop the
simulation. MemorySpace::get_memory_ref is used to access this and see if a
stop has been asked for terminating the simulation if so.
Performance and Testing
Finally a word on the performance of rrs and the testing methodology.
As way to test performance and test the implementation I ran the CoreMark benchmark on rrs. The actual benchmark result is meaningless but what is interesting is how many instructions per second rrs can execute. A run of a 1000 CoreMark iterations on my Thinkpad 490s (with an Intel i7-8565U) executes 275063040 instructions in 5.28s, giving us an instructions per second of 52.1 MHz. My CPU has a turbo frequency of 4.6 GHz and a 5 way decode so could execute at most 23 Billion instructions per second giving a ratio of ~440 real CPU instructions per emulated RISC-V instruction and in reality we won’t be near that theoretical max instructions per second so the ratio may be more like 200 or lower. A pretty decent result given I wasn’t trying hard to achieve high performance. Room for improvement though, at a later date I may take a look at how to improve performance.
I ran the same Coremark binary on
Spike using time to measure
execution time, this gave a wall clock time of 8.852 vs 5.376 for rrs-cli (this
is total run time including initialisation), so rrs is ~60% faster. Spike is a
far more complete and featureful simulator so it’s not a entirely fair
comparison but still rrs stacks up well.
CoreMark also helps test rrs. It computes a checksum during the benchmark and checks against an expected value. This would fail if the benchmark executed incorrectly. CoreMark should give a reasonable workout of the RV32IM instruction set so this is a decent test of functionality.
On top of running CoreMark I’ve got a small set of tests in Rust. So far these test:
- The instruction format structures (e.g. 
instruction_formats::RType) decode instructions correctly. InstructionStringOuptterproduces a correct string for all implemented instructions. This checks that bothInstructionStringOutputterandprocess_instructionare doing the right thing.InstructionExecutorcorrectly executes a small program that uses each ‘kind’ of instruction (a branch and a load are different kinds, an add and a sub are not).- The various 
Memoryimplementations function correctly. 
In particular I don’t have any Rust tests exhaustively testing the execution of all instructions. Instead I plan to use tests like CoreMark and the RISC-V Architectural Test Framework. I’d want to run these tests regardless of other tests (as demonstrating succesful execution of binaries is clearly important) and separate per instruction tests in Rust would mostly duplicate this so I decided to spend effort elsewhere.
What’s Next
There’s a few things I’d like to work on next in the rough order I’ll tackle them:
- An example RISC-V software build flow
 - Run the RISC-V Architectural Test Framework
 - Implement exception handling
 - Implement page tables
 - Implement a device tree for rrs-cli either the DTS format used by Linux or something custom.
 - Get xv6-riscv running
 - Implement debug support
 - Get Linux running?
 
In doing some of these I’ll likely need to implement some more extensions (such as F for floating point and A for atomics).
- 
The Little Book of Rust Macros is good guide to Rust macros if you want to learn more ↩︎
 - 
This isn’t a native language feature, it’s a external crate downcast-rs ↩︎