Enabling Low Impact, Rapid Debug for Highly Utilized FPGA Designs

Robert Hale
robert.hale@byu.edu

Brad Hutchings
brad_hutchings@byu.edu

Abstract—Inserting soft logic analyzers into FPGA circuits is a common way to provide signal visibility at run-time, helping users locate bugs in their designs. However, this can become infeasible for highly (70-90+% utilized) designs, which leave few logic resources or block RAMs available for internal logic analyzers. This paper presents a fast, low-impact method of enabling signal visibility in these situations using LUT-based distributed memory. Trace-buffers are inserted post-PAR allowing users to quickly change the set of observed nets. Results from routing-based experiments are presented which demonstrate that, even in highly utilized designs, many design signals can be observed with this technique.

I. INTRODUCTION

It is widely understood that debugging FPGA circuits in-system is extremely challenging because observing the behavior of signals typically requires the insertion of invasive debug circuitry. In-system test and debug is necessary because simulation is often unable to test the user circuit with rich data sets in timely fashion. In the simplest case, inserted debug circuitry may consist solely of wires that route internal signals to external pads where they can be captured and observed with a logic analyzer. More commonly, complex debug circuitry that resembles the data-recorder of a logic analyzer (referred to as an Integrated Logic Analyzer, or ILA) is inserted into, and placed and routed with the original user circuit.

It can be difficult or impossible to insert an ILA in industrial designs as they are finalized because engineers typically attempt to exhaust as much of an FPGA device as possible in order to achieve lower cost. For example, once engineers implement circuits that utilize FPGA devices at 90%, or if their designs completely consume all available block RAMs (BRAM), it may become difficult or impossible to insert and use an ILA.

The goal of this project is to provide an alternate debugging tool that can be used when ILA insertion is not possible or feasible. This debugging tool is implemented post-place/route to minimize impact on the user’s design and reduce implementation time. BRAMs aren’t used for memory, leaving these resources fully available to the user design.

This tool achieves observability by scavenging for unused shift-register LUTs (SRLs) in a design and using them, paired with a 2-to-1 MUX, to capture and upload user-signal values at run-time. In Xilinx devices, SRLs are 16- or 32-bit shift registers that can be implemented on 50% of the device’s LUTs. Thus, unused SRLs are commonly available even in highly utilized devices. For each signal the user wishes to probe, the tool finds the closest available SRL, routes it to its respective signal, and then connects all thusly used SRLs together into a single shift-register. During circuit operation, user signal values are captured in these SRLs. Once the defined trigger signal occurs, signal capture stops, and the captured values are shifted out to the JTAG port via inserted BSCAN primitives. Though this SRL-based approach may not provide as deep of a trace of user signals as an ILA might, it does provide observability when ILAs are infeasible or when it would take too long to place and route them into the user design.

This SRL-based debugging approach uses the recently introduced RapidWright [13] from Xilinx to add SRL primitives after place and route. RapidWright is an open source platform that provides an interface to Xilinx’s Vivado back-end implementation tools. This SRL-based approach queries the Vivado circuit database via RapidWright to do the following: (1) determine the location of available SRL primitives, and (2) place the SRL primitives near their corresponding user signal. Routing of the circuit to the SRLs is completed via the Vivado back-end routing tool.

II. RELATED WORK

A. Commercial Debug Tools

FPGA vendors provide internal debug tools to work with their hardware, such as Xilinx’s Integrated Logic Analyzer [3] and Altera’s SignalTap [2]. These tools offer a powerful method of viewing design signals in real time. However, they also require substantial resources for their implementation. In addition, these tools typically cannot be added to a user circuit post-map and require the entire user circuit to be re-placed and re-routed. If the engineer wants to change the set of nets being probed this entire compilation process needs to be repeated. For complex and highly utilized designs such iterating can consume an unacceptable amount of time. Some commercially available design tools include incremental functionality that attempts to reuse placed or routed elements to reduce compilation time, however, our tests of these features showed only trivial improvement. Finally, these tools require the presence of unused BRAMs that are relatively limited on FPGA devices.

B. Academic Tools

Several research projects have attempted to address the issues with FPGA debug visibility. Ideas have included guessing what nets should be probed beforehand [5] [17] or attempting to probe nearly every net in the design [1]. Other techniques involve scanning the entirety of the FPGA’s current status [16] [15], referred to as readback. This gives
full visibility and requires no added logic or use of FPGA resources, but requires stopping the DUT and offers no signal history.

One research effort has sought to imitate software debugging as closely as possible for FPGAs [11], including full breakpoints and live variable visibility. This comes at the expense of high FPGA fabric overhead. One similarity this method shares with our tool is avoidance of BRAM use, favoring distributed LUT memory.

Other researchers have proposed ways to reduce the time it takes to insert debug circuitry by instrumenting the circuit post place/route, or by partitioning the debug circuitry into a small area on the chip [7] [6] [4] [8]. Overlays have also been proposed to enhance debugging of FPGAs [9].

The most similar work to that presented here is that of Keeley and Hutchings [10]. Their debug tool is similarly instrumented post-implementation with RapidSmith [12]. However, the produced trace buffers are more similar to other research methods in their use of BRAM trace buffers.

Our SRL-based debug trace buffer tool differs from previous work because it focuses on designs that highly utilize the FPGA device (70% or higher). Alongside the benefit of reducing debug time with incremental insertion, our tool finds trace amounts of unused logic resources and utilizes them to enable debugging. Where other internal debug methods would fail due to high resource needs or exhausted BRAMs, our SRL trace buffers can squeeze into crowded designs and provide at least some amount of signal visibility.

### III. SRL-BASED DEBUGGING

Instrumentation of our debug tool is executed using a software suite of Xilinx FPGA design editing tools called RapidWright [13]. RapidWright allows post-PAR modification of design files. RapidWright is similar to its parent tool, RapidSmith [12]. The instrumentation steps are listed below, starting with the user design.

a) **User Design:** The user creates a design and completes the Xilinx Vivado design implementation process. Nets to be debugged are marked for debug in the typical Vivado flow and the design is exported to RapidWright.

b) **Insert Probes:** SRL-based trace buffers, consisting of a 16-bit SRL coupled with a 2-to-1 MUX, are inserted into the design. One is created for each net requested for probing. The source tile of the debug net is identified, and the trace buffer is placed as close as possible to minimize timing issues. However, if needed, the trace buffer can extend to any location on the chip with unused LUTs. The trace buffer is then linked to the chain of other trace buffers in the design (see Figure 1).

SRL-based trace buffers operate in two modes, operation mode and debug mode. In operation mode the MUX passes the value on the debugged net into the SRL. The SRL connects to the same clock that drives that net, recording each subsequent value. Data is stored in first-in, first-out fashion.

Debug mode is used after triggering has occurred and debug data is requested. In this mode, the MUXs of all MUX-SRL pairs will chain together, passing data to the next MUX-SRL pair. The last pair passes data to a BSCAN primitive that interfaces with the host. The correct number of data bits is collected by knowing beforehand how many MUX-SRL pairs have been instrumented into the design.

c) **Route, BitGen:** After instrumentation, RapidWright produces a new, modified design checkpoint. Xilinx tools are then used to ensure the design is without error, route the design, and generate a bitstream. Route is performed incrementally, meaning that the original user design will be left undisturbed where possible and run-time is minimal. Xilinx bitgen then generates the bitstream.

d) **Debug:** The bitstream is downloaded to the FPGA and the hardware target is closed from program mode and reopened in JTAG mode. Once the debug system has been triggered, the engineer can send appropriate commands over JTAG to request data from the MUX-SRL chain. This data becomes visible at the user terminal. Scripting is then used to format this data and view it in a waveform. Since the order of the chained SRLs is known, data can be identified by net name automatically. The entire tool-chain was tested and found to correctly capture and display data captured from a simple counter. For the larger benchmark used in this paper, only routing experiments were performed.

### IV. ROUTING EXPERIMENTS

The goal of this work is to show that the SRL-based debug approach can successfully route to a high percentage of user signals even when the user circuit nearly exhausts all resources on the FPGA device. Successful routing of user signals will depend on the availability of SRL primitives near the desired signal, routing congestion, etc. The approach to the experiments is as follows. For example, assume that you want to determine the likelihood of routing a set of 10 signals in a circuit that utilizes 90% of an FPGA.

1) Randomly select 10 signals from the user net-list.
2) Connect the selected signals to available SRLs.
3) Attempt to route the design.

Repeat this process N times (selecting a different random set of signals each time). If, for example, N/2 of the attempts successfully routed, you may predict that, on average, 10...
signals can be successfully routed in a highly-utilized device about 50% of the time. For these experiments, N is equal to 200. These experiments are repeated for varying numbers of signals (from 10s of signals to 1,000s of signals) and are applied to three benchmark circuits (70%, 80%, and 90% LUT utilization for a Kintex Ultrascale XCKU025 containing 145,440 CLB LUTs). The benchmark circuit is created by implementing an array of LC-3 [14] soft processors. The success rate for the SRL-based approach is compared against the success rate for Vivado’s ILA by routing the ILA with the same sets of signals.

During the instrumentation phase, checkpoint files containing the benchmark designs are edited using RapidWright tools. These tools insert and place MUX-SRL pairs in the benchmark design for each net marked for debug. Next, Vivado route is run to connect the pins between probed nets and the MUX-SRL pairs. Placement, however, is completed entirely during the RapidWright stage and user design logic
place and route. We chose a standard level of optimization in an attempt to give both tools a roughly equal chance at success, as well as keep runtime moderate. For our tool, incremental route is allowed to rip-up and replace nets only where needed. All of the above steps are completed in an automated fashion on a remote supercomputer. Each combination of design and probe count was attempted 200 times to show trends. Scripts record any errors produced during the steps as well as completion time in the case of success. A total of 8,400 experiments were conducted.

V. EXPERIMENTAL RESULTS

Results from the three benchmark designs tested are summarized in Figures 2, 3, and 4. As shown, regardless of design size, the SRL-based debug tool was able to probe a significantly higher number of design nets than Xilinx’s ILA tool. In addition to enabling debug at high design densities, insertion of the SRL-based debug logic was far faster than ILA insertion (see Table I). No results are shown for instrumentation time of the ILA in a 90% utilized design, since none of those experiments were successful.

For each successful experiment, regardless of design utilization, the SRL-based approach was able to place and route probes in the design without disturbing the placement of the user design in a fast, incremental fashion. In addition, no BRAMs were consumed. If these dense designs required the use of all BRAMs available on the chip our SRL-based debug tool could still be used to capture and view signal values.

VI. FUTURE WORK

Future work may include: 1) studying timing impacts caused by the instrumentation process, 2) designing improved triggering methods, 3) applying this tool to additional benchmark circuits, and 4) providing deeper, variable-length SRL-based trace buffers.

VII. CONCLUSION

We have presented an SRL-based FPGA debug tool that is capable of leveraging very small amounts of leftover logic resources in highly utilized designs. This paper presents a basic proof-of-concept of its ability to act as a fully triggered debugging tool as well as the feasibility of probing a number of nets even in designs that utilized up to 90% of the FPGA. Though the SRL-based trace buffers are relatively small relative to an ILA, for example, they are able to provide observability when ILA-based techniques are infeasible.

REFERENCES


<table>
<thead>
<tr>
<th>Design LUT density</th>
<th>Average Time Consumed (minutes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>70%</td>
<td>12</td>
</tr>
<tr>
<td>80%</td>
<td>15</td>
</tr>
<tr>
<td>90%</td>
<td>14.5</td>
</tr>
</tbody>
</table>

TABLE I

IMPLEMENTATION TIME FOR SRL AND ILA