Playing with the Pico Part 5 - Producing VGA Video

Lets talk about VGA signalling. VGA uses 5 wires, 3 are analog signals and represent red, green and blue, the higher the voltage the brighter the colour. The other 2 are sync and are digital signals, they tell the monitor how the colour signals relate to the pixels.

One of the sync signals is the horizontal sync (HSYNC), the other is the vertical (VSYNC). They both work in very similar ways, let’s take a look at the horizontal sync first. Below is a timing diagram that illustrates a single horizontal line of video, known as a scanline.

Timing diagram showing hsync pulse and visible line

The HSYNC is just a pulse between scanlines. It is ordinarily held high and goes low for the sync pulse. The critical thing is the timing, driven by the pixel clock. The pixel clock measures how quickly the RGB signals change, each tick of the pixel clock means a new RGB value and a new pixel. The timing of the sync pulse is measured in terms of the pixel clock. Along with the sync pulse we have the back porch and the front porch, no pixels are output during all three. This is known as the horizontal blanking period. On CRT monitors this was required to allow the beam time to get back to the left side of the screen. I’ve marked the front and back porch and sync pulse along with the visible line on the diagram with letters. Here’s the length of each in pixel clocks at 640x480:

The VSYNC does exactly the same thing, only the timing is done in scanlines. Again there are front and back porches around the sync pulse and a visible region. This means again there’s a point where nothing is displayed known as the vertical blanking interval. This is used by CRT monitors to give time for the beam to go back to the top of the screen. Here’s a timing diagram showing horizontal and vertical sync.

Timing diagram showing vsync pulse and visible frame

The PIO program described in the last blog will be the basis of a PIO program to produce the horizontal and vertical sync signals. For each line (both those in the visible portion and the blanking portion) we’ll have 4 command words, one for each of the sync pulse, back and front porch and the visible portion. The delay part of the command word gives the delay and the pin settings control the vertical and horizontal sync outputs.

The program requires one extra feature, a way to signal when we are at various parts of the line or frame. In particular we want something to sync up another PIO program, one that will output our pixel data, with the beginning of a visible line.

We’ll use a special destination of the out instruction to do this. out can write to an exec destination. What this does is shift data from the OSR and immediately execute it as a PIO instruction. This will be used to add a PIO instruction to the command word. The sync PIO program will sit in a loop, pulling a 32-bit command word from the TX FIFO doing three things:

  1. Set HSYNC and VSYNC based on the bits 0 and 1
  2. Execute the PIO instruction from the command (bits 17 - 2)
  3. Enter a delay loop, delaying the number of cycles from the command (bits 31 - 18)

This is just my previous test program with the addition of the instruction execution at 2. The instruction will either be nop doing nothing or irq raising an interrupt that will signal the beginning of a line. To produce a full frame of sync we just need to send the correct command words to the sync SM. Here’s the PIO code:

.program sync_out
    pull ; get next command word
    out pins, 2 ; first 2 bits are hsync/vsync
    out x, 14 ; next 14 bits are delay
    out exec, 16 ; final 16 bits are instruction to execute
wait_loop:
    jmp x--, wait_loop ; wait for the given delay value
.wrap

To go along with this we need something to produce the pixel data. We’ll use a separate SM I’ll call the line SM that just pulls pixel data out of a FIFO and writes it out to the pins. We’ll use a 15-bit format of 5 bits red, 5 bits green and 5 bits blue with an external digital to analog converter (DAC) to produce the varying voltages for the VGA signal (16 bit with an extra bit for green is more usual but I only implemented 5 bits per channel on my DAC).

The line SM will sync itself up with the sync SM using the wait instruction to wait for a start of line IRQ from the appropriate sync SM command word. When the wait is done it will stream the pixel data from the FIFO out to the pins. Initially I’m going to use the logic analyzer setup from the last blog to test things (no monitor connected to the pins) so to keep it simple we’ll have 10 pixels per line and only 6 bits per pixel (giving 1 byte per logic analyzer sample, 6 bits of pixel and 2 bits of sync). Here’s the line PIO program:

.program line_out
    wait 1 irq  4 ; wait til triggered by sync SM
    set x, 9 ; Setup x as pixel counter (10 pixels in a line)
line_loop:
    pull ; get next pixel
    out pins, 6 [7] ; output to the pins and delay (to give 10 cycles per loop)
    jmp x--, line_loop ; output next pixel if we have pixels left
end:
    set x, 0 ; output a 0 pixel after the visible line
    mov pins, x
.wrap

The pixel data needs to change at the pixel clock rate. For 640x480 that is 25.175 MHz. You can get away with not being too precise so the pixel clock will be 25 MHz, a nice multiple away from the Pico’s default 125 MHz (one pixel per 5 cycles). I’m actually going to produce graphics at 320x240 just to minimise the number of pixels to produce per frame but do this by outputting the same pixel twice in each line rather than producing a native 320x240 video signal. Overall this means the line PIO needs to output a new pixel value once every 10 cycles.

Here’s the command words we’re feeding the sync PIO program, delays don’t yet correspond to actual VGA timings, we’re just using a few low numbers to see what it produces. The VSYNC line is kept low and the HSYNC line switches with each new command word, rather than producing proper sync pulses. This will allow us to analyse how the delays relate to our output. We’ll look at everything in reference to when the out pins, 2 instruction executes on the sync SM (I’ll refer to it as the sync pin set). By switching the HSYNC line every new command word the positive and negative edges of the HSYNC line show us when the out pins, 2 instruction is being executed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// vsync always 0 for this test
// Execute nop, delay 10, hsync = 1
sync_timing_buffer[0] =
    (pio_encode_nop() << 16) | (10 << 2) | 1;
// Execute nop, delay 15, hsync = 0
sync_timing_buffer[1] =
    (pio_encode_nop() << 16) | (15 << 2) | 0;
// Execute nop, delay 15, hsync = 1
sync_timing_buffer[2] =
    (pio_encode_nop() << 16) | (15 << 2) | 1;
// Execute `irq 4` to start pixel output, delay 100, hsync = 1
// The visible line has 10 pixels, with a new pixel every 10 cycles
// needing 10 * 10 = 100 cycles. Set our delay to that for now so we'll
// roughly hit the right timings.
sync_timing_buffer[3] =
    (pio_encode_irq_set(false, 4) << 16) | (100 << 2) | 0;

For the pixel data we’ll just use numbers 1-40. The line PIO outputs 0 when we’re outside of the visible region so 1-40 is used so we can see when we’re in the pixel output loop part of the line PIO. Here’s the result showing the first line:

Timing diagram showing test hsync and vsync along with 10 pixels of visible line

Full code can be found on github

How does the delay value in the command word map to the cycles seen for each low/high period of the HSYNC? The sync PIO program has 5 instructions so with a delay of 0 you’d expect the HSYNC to be held high or low for 5 cycles. But it’s actually 6. The out exec, 16 is hiding an extra cycle, one cycle to shift an instruction from the OSR, another cycle to execute it. If we measure the high and low periods of the HSYNC line we find they each take the corresponding delay value + 6 as expected. E.g. look at the bit highlighted in blue where HSYNC is held high. This corresponds to sync_timing_buffer[2], that has a delay value of 15 and HSYNC is held high for 21 cycles (15 + 6).

The trickier part is how the timing of the command words relates to the line SM which outputs the pixel data. Look back at the sync program. After fetching a new command word it immediate sets the pins, reads the delay value and executes an instruction via out exec and delays after all of this is done. The irq instruction is used to trigger the pixel output from the line SM. The line SM will execute its set x, 9 instruction two cycles after the irq instruction is executed. One cycle after the irq is executed the wait 1 irq 4 instruction sees it’s been raised, so the cycle after than set x, 9 is executed. The two PIO programs line up like this:

|      Sync SM                        |                       Line SM                                     |
-----------------------------------------------------------------------------------------------------------
| out pins, 2                         | wait 1 irq  4 ; line SM waiting for irq 4                         |
| out x, 14                           | wait 1 irq  4                                                     |
| out exec, 16                        | wait 1 irq  4                                                     |
| [irq 4]      ;irq from command word | wait 1 irq  4                                                     |
| jmp x--, wait_loop                  | wait 1 irq  4 ; irq 4 set last cycle, so final cycle of wait here |
| jmp x--, wait_loop                  | set x, 9                                                          |
| jmp x--, wait_loop                  | pull                                                              |
| jmp x--, wait_loop                  | out pins, 6 [7]                                                   |

Looking at this you can see there’s 7 instructions between pins being set in the sync SM based on the command word and the first pixel being output in the line SM. This results in a delay of 7 cycles between the two. We can measure the gap on the logic analyzer output to confirm:

Timing diagram showing zoom of start of line

How about at the end of the line? What delay do we need in the sync SM command word for the visible part of the line so that the pin set of the next sync SM command word lines up with 0 pixel output at the end of the line? The Line SM outputs a new pixel every 10 cycles, so the time taken from the first pixel output to the 0 pixel output is 100 cycles. In total we have 107 cycles between the sync SM setting the pins and the 0 pixel output. Between pin settings in the sync SM we have 6 cycles + the delay value. To get the pin set to match with the 0 pixel output we want a total of 107 cycles between pin settings giving us a delay value of 101 (107 = 6 + 101). Our test command words used a delay value of 100 for the visible line, 1 less than we need. If we look at the logic analyzer the pin setting for the front porch command word happens 1 cycle before we see the 0 pixel, confirming our analysis.

Timing diagram showing zoom of end of line

Finally before we can produce a proper video signal we need to tweak our line PIO program. It sets x to 9 as a pixel counter, we’ll have 320 pixels each line and set only has 5-bits of immediate (so 31 is the biggest value we could set x to). We work around this using the input shift register to make 319 out of individual 5-bit pieces, we’ll also want to output 16 bits per pixel (the top bit gets ignored but then a pixel is 2 bytes making the image encoding simpler).

.program line_out
    wait 1 irq  4 ; wait til triggered by sync SM
    set x, 31 ; Setup x as a pixel counter (x = 319)
    in x, 5
    set x, 9
    in x, 27
    mov x, isr
line_loop:
    pull ifempty ; only pull if we've shifted out two pixels (32 bits)
    out pins, 16 [7] ; output to the pins and delay (to give 10 cycles per loop)
    jmp x--, line_loop ; output next pixel if we have pixels left
end:
    set x, 0 ; output a 0 pixel after the visible line
    mov pins, x
.wrap

The new pixel counter setup adds some extra delay between the sync SM pin set and the first pixel output. It’s 4 extra instructions so that delay goes from 7 to 11 cycles.

With that analysis done we can work out the actual delays we need for proper VGA 640 * 480 timings. Remember our pixel clock is 25 MHz and the pico is running at 125 MHz so the VGA timing values above need multiplying by 5. Here’s the final delays we’ll use

Note the pixel output delay applied to the back porch and visible line timings. The back porch command delays for fewer cycles than the actual back porch as there’s 11 cycles between the visible line doing its sync pin set and the first pixel appearing and the actual back porch must be timed to end with the first pixel appearing. Similarly the visible line gains those cycles so the front porch sync pin set lines up with the 0 pixel being output. Luckily as the HSYNC and VSYNC don’t actually have to change between the back porch, visible line and front porch the timing different between the sync pin set happening and the pixel output doesn’t matter. Here’s our final set of command words to give horizontal sync timing for the visible line.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Hsync pulse Execute nop, delay 474, vsync = 1, hsync = 0
visible_line_timing_buffer[0] =
    (pio_encode_nop() << 16) | (474 << 2) | 2;
// Back porch, execute nop, delay 223, vsync = 1, hsync = 1
visible_line_timing_buffer[1] =
    (pio_encode_nop() << 16) | (223 << 2) | 3;
// Visible line, execute irq 4, delay 3205, vsync = 1, hsync = 1
visible_line_timing_buffer[2] =
    (pio_encode_irq_set(false, 4) << 16) | (3205 << 2) | 3;
// Front porch, execute nop, delay 74, vsync = 1, hsync = 1
visible_line_timing_buffer[3] =
    (pio_encode_nop() << 16) | (74 << 2) | 3;

Next up is producing a full frame’s worth of sync. Luckily getting the horizontal sync and pixel output was the hard part. Vertical sync is timed in terms of lines so all we have to do is feed the same commands words we have for the visible line buffer with some slight tweaks. Producing the vsync pulse in the appropriate lines and only signalling the interrupt that starts line output in the visible lines. This gives us two extra command buffers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Non visible lines use same timings for uniformity, always execute nop as
// no pixels are output.
// Sync command words for vsync front and back porch
vblank_porch_buffer[0] =
    (pio_encode_nop() << 16) | (474 << 2) | 2;
vblank_porch_buffer[1] =
    (pio_encode_nop() << 16) | (223 << 2) | 3;
vblank_porch_buffer[2] =
    (pio_encode_nop() << 16) | (3205 << 2) | 3;
vblank_porch_buffer[3] =
    (pio_encode_nop() << 16) | (74 << 2) | 3;

// Sync command words for vsync sync pulse
vblank_sync_buffer[0] =
    (pio_encode_nop() << 16) | (474 << 2) | 0;
vblank_sync_buffer[1] =
    (pio_encode_nop() << 16) | (223 << 2) | 1;
vblank_sync_buffer[2] =
    (pio_encode_nop() << 16) | (3205 << 2) | 1;
vblank_sync_buffer[3] =
    (pio_encode_nop() << 16) | (74 << 2) | 1;

There’s just one ‘vblank porch’ buffer for both front and back porch as they’re identical in terms of what happens each line.

To tie this all together we’ll use two DMA channels, one will feed the sync SM with command words and the other will feed the line SM with pixel data. We can use a DMA transfer complete interrupt handler to coordinate it all. For both we need to keep track of what line is next for each SM, noting it may not always be the same. The sync SM FIFO can hold 8 command words, 2 lines worth which the sync DMA will swiftly fill. Meanwhile we have 320 pixels, 160 32-bit words, for the line DMA to transfer. Thus the sync DMA channel can be a couple of lines a head of the line DMA channel. Here’s the interrupt handler:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
int current_timing_line = 0;
int current_display_line = 0;

void __not_in_flash_func(dma_irh)() {
    if (dma_hw->ints0 & (1u << sync_dma_chan)) {
        dma_hw->ints0 = 1u << sync_dma_chan;

        if (current_timing_line < 524) {
            current_timing_line++;
        } else {
            current_timing_line = 0;
        }

        // `current_timing_line` is the line we're about to stream out sync
        // command words for
        if (current_timing_line == 0 || (current_timing_line == 1)) {
            // VSync pulse for lines 0 and 1
            dma_channel_set_read_addr(sync_dma_chan, vblank_sync_buffer, true);
        } else if (current_timing_line < 35) {
            // VGA back porch following VSync pulse (lines 2 - 34)
            dma_channel_set_read_addr(sync_dma_chan, vblank_porch_buffer, true);
        } else if (current_timing_line < 515) {
            // Visible lines following back porch (lines 35 - 514)
            dma_channel_set_read_addr(sync_dma_chan, visible_line_timing_buffer, true);
        } else {
            // Front porch following visible lines (lines 515 - 524)
            dma_channel_set_read_addr(sync_dma_chan, vblank_porch_buffer, true);
        }
    }

    if (dma_hw->ints0 & (1u << line_dma_chan)) {
        dma_hw->ints0 = 1u << line_dma_chan;

        if (current_display_line < 479) {
            ++current_display_line;
        } else {
            current_display_line = 0;
        }

        dma_channel_set_read_addr(line_dma_chan, line_data_buffer, true);
    }
}

At this point it’d be great if we had sufficient memory to capture a full frame’s worth of pixel output and sync with the logic analyzer SM but we don’t. Instead to test things I temporarily dropped to 2 bits per pixel, maintained the same horizontal line timings but only output a small number of vertical lines, 8 visible lines, 1 VSYNC line, 3 front porch lines and 2 back proch lines. For the 2-bit pixel data I used an alternating 01, 10 pattern. This lets you see the pixels change as well as providing distinct visible and blank portions on the pixel data (only time we should see 00 is when we’re not outputting pixel data). With this restricted frame we can fit everything in memory, here’s the result from the logic analyzer:

Timing diagram showing full frame with a reduced number of lines

I zoomed around on this, measuring the various intervals, for example here’s a zoom of the beginning of a line measuring the back porch (the time between the sync pulse and the beginning of the visible line). It is 240 cycles long, so 240 / 5 = 48 pixel clocks, exactly the timing we need. I checked all of the timings and everything looked good. We are ready to produce video!

Timing diagram showing zoom of back porch

Well almost, how exactly are we generating the pixels to show? The observant will note the interrupt handler just has a buffer for a single line line_data_buffer, there is no frame buffer. This is for memory reasons. The Pico does just have enough memory for a 320x240 framebuffer at 16 bits per pixel (320 * 240 * 2 = 150 KiB) but I plan to generate pixel data on the fly rather than using a framebuffer. I’ll get into the details in the next blog, for now we’ll make do with a test pattern, one pixel of red, one of blue, one of green and one of black repeated over and over. I’ll put a white pixel at the beginning and end as well as make a modification to the interrupt handler to use a different buffer of entirely white pixels for the first and last lines. This will give a white border to the test pattern and allow us to check the edges of the screen end up being where we think they are. Here’s the modified part of the interrupt handler:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
if (dma_hw->ints0 & (1u << line_dma_chan)) {
    dma_hw->ints0 = 1u << line_dma_chan;

    if (current_display_line < 479) {
        ++current_display_line;
    } else {
        current_display_line = 0;
    }

    // `current_display_line` is the line we're about to stream pixel data
    // out for
    if ((current_display_line <= 1) || (current_display_line >= 478)) {
         // Output white border line for lines 0, 1, 478, 479 (two lines as
         // ultimately producing 320x240 resolution so each line is repeated
         // twice).
        dma_channel_set_read_addr(line_dma_chan, white_line_data_buffer, true);
    } else {
        // Otherwise use a pattern line
        dma_channel_set_read_addr(line_dma_chan, pattern_line_data_buffer, true);
    }
}

Full code can be found on github

Finally how do we actually connect the VGA cable up to the pico? One answer is the Pimironi Pico VGA Demo Base which I haven’t yet bought (indeed it wasn’t available when I started working on this) but it’s easy enough to rig it up yourself. The sync lines can connect straight through via current limiting resistor. For the R,G,B lines you can create a primitive but effective digital to analog convertor using just some resistors, see Hardware design with RP2040 for the gory details (Chapter 3 which describes a board very similar to the Pimoroni one). Here’s a photo of my breadboard setup wired into a VGA connector breakout (which you can obtain from ebay or amazon).

Breadboard with resistor based VGA DAC

With everything now in place let’s connect up the monitor, run our test program and see what happens…

Running the test pattern program with the breadboard DAC

Success! Here’s a screen close up so you can see the pattern better:

Zoom of the test pattern on the screen

Note that it’s important not to setup or try to use the USB UART (don’t call stdio_init_all, printf etc). It seems to interfere with the video signal generation and cause some visual artifacts, probably because of interrupts dealing with USB. The plain serial UART may work fine but I haven’t tried it out.

Next up we want to produce something more interesting than a test pattern which I’ll discuss in the next blog.