Publications

This page lists the research publications which have been carried out in the context of the HACC program, or papers that may be of interest to the HACC community.

Contribute

If you would like to contribute to this page by adding a reference to your publication, please follow the contribution guidelines.

Search publication by year:

2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016

2025

Title & Abstract	Author(s)	Institution	Link
A Hardware/Software Co-Design Approach for Versal-Based K-means Acceleration Abstract "K-means is a pivotal clustering algorithm widely accelerated through specific architectures to mitigate the computational load. Nevertheless, existing solutions lack a comprehensive co-design approach for optimization. By combining AI engines with the reconfigurability of FPGAs, this research focuses on Versal systems and proposes a co-design process that addresses both software algorithm optimization and hardware analysis. The proposed accelerator achieves a speedup of up to 21.41x compared to state-of-the-art solutions."	Eleonora Cabai et al.	Politecnico di Milano	Paper GitHub
A Unified Framework for Automated Code Transformation and Pragma Insertion Abstract "High-Level Synthesis compilers and Design Space Exploration tools have greatly advanced the automation of hardware design, improving development time and performance. However, achieving a good Quality of Results still requires extensive manual code transformations, pragma insertion, and tile size selection, which are typically handled separately. The design space is too large to be fully explored by this fragmented approach. It is too difficult to navigate this way, limits the exploration of potential optimizations, and complicates the design generation process. To tackle this obstacle, we propose Sisyphus, a unified framework that automates code transformation, pragma insertion, and tile size selection within a common optimization framework. By leveraging Nonlinear Programming, our approach efficiently explores the vast design space of regular loop-based kernels, automatically selecting loop transformations and pragmas that minimize latency. Evaluation against state-of-the-art frameworks, including AutoDSE, NLP-DSE, and ScaleHLS, shows that Sisyphus achieves superior Quality of Results, outperforming alternatives across multiple benchmarks. By integrating code transformation and pragma insertion into a unified model, Sisyphus significantly reduces design generation complexity and improves performance for FPGA-based systems."	Stéphane Pouget et al.	UCLA	Paper
Accelerated Phylogenetics on the AMD® Versal™ Adaptive SoC Abstract "Phylogenetics study the evolutionary history of organisms using an iterative procedure of creating and evaluating phylogenetic trees. This procedure is highly compute-intensive; constructing a large phylogenetic tree requires hundreds to thousands of CPU hours. Most phylogenetic analyses today rely either on maximum likelihood (ML) or Bayesian inference (BI) methods for inferring phylogenetic trees; the phylogenetic likelihood function (PLF) is employed in both ML and BI approaches as the tree-evaluation function, accounting for up to 95% of the overall analysis time. In this work, we explore the AMD® Versal™ Adaptive SoC architecture for accelerating the PLF that heavily relies on matrix multiplication operations. We find that the tight integration of domain-specific processors (AI Engines) with programmable logic (PL) in the Versal architecture is highly suitable for the needs of the PLF: we map the core operation of the PLF (matrix multiplication) to the AI Engines and deploy special-purpose units in the PL for PLF-specific data caching and input/output management, as well as numerical scaling that is a prerequisite for yielding numerically stable solutions for large-scale phylogenetic studies. We conducted a thorough performance analysis to leverage the platform capabilities to guide matrix-multiplication acceleration that requires close PL-AIE cooperation. We observed between 23.8x and 47.0x higher computational power of the Versal SoC than one x86 CPU core (both AMD® and Intel®) using AVX2 intrinsics, and between 3.7x and 5.9x higher performance than eight cores. For the full system, we observe comparable performance (±30%) with 8 CPU cores due to device memory access and PCIe® limitations, showing the potential of the Versal architecture in accelerating a distinct application from DSP/AI, its primary design focus."	Geert Roks et al.	University of Twente	Paper
AuroraFlow, an Easy-to-Use, Low-Latency FPGA Communication Solution Demonstrated on Multi-FPGA Neural Network Inference Abstract "Communication between FPGAs has become increasingly important in recent years, with a multitude of implementations available. In this work, we focus specifically on direct FPGA-to-FPGA connections with a serial protocol. We present AuroraFlow−which is built on top of the official IP for the Aurora protocol from AMD−and use it to implement multi-FPGA neural network inference. AuroraFlow makes it easy to integrate low-latency communication into existing FPGA applications using High-Level Synthesis (HLS). It achieves full throughput, has integrated monitoring to detect errors at run-time, and implements flow control, which prevents data loss caused by receiver overruns. In a circuit-switched optical network, we are able to achieve an average latency of down to 0.51 µs and an average throughput of up to 95.03 Gbit/s for a message size of 1 MiB. To demonstrate AuroraFlow beyond microbenchmarks, we execute distributed neuronal network inference on multiple data center FPGAs. For this, we use FINN, a framework for creating dataflow hardware designs for neural network inference. We extend FINN by integrating AuroraFlow to work with multiple FPGAs. We showcase the commonly referenced MobileNet network architecture as an example where both tools can be combined: While FINN takes care of creating the inference design itself, AuroraFlow manages the communication between the FPGAs. Using AuroraFlow gives us the possibility to keep the benefit of very low inference latency that FINN enables, while allowing us to scale beyond a single FPGA."	Gerrit Pape and Bjarne Wintermann et al.	Paderborn University	Paper
BPAP: FPGA Design of a RISC-like Processor for Elliptic Curve Cryptography Using Task-Level Parallel Programming in High-Level Synthesis Abstract "Popular technologies such as blockchain and zero-knowledge proof, which have already entered the enterprise space, heavily use cryptography as the core of their protocol stack. One of the most used systems in this regard is Elliptic Curve Cryptography, precisely the point multiplication operation, which provides the security assumption for all applications that use this system. As this operation is computationally intensive, one solution is to offload it to specialized accelerators to provide better throughput and increased efficiency. In this paper, we explore the use of Field Programmable Gate Arrays (FPGAs) and the High-Level Synthesis framework of AMD Vitis in designing an elliptic curve point arithmetic unit (point adder) for the secp256k1 curve. We show how task-level parallel programming and data streaming are used in designing a RISC processor-like architecture to provide pipeline parallelism and increase the throughput of the point adder unit. We also show how to efficiently use the proposed processor architecture by designing a point multiplication scheduler capable of scheduling multiple batches of elliptic curve points to utilize the point adder unit efficiently. Finally, we evaluate our design on an AMD-Xilinx Alveo-family FPGA and show that our point arithmetic processor has better throughput and frequency than related work."	Rares Ifrim et al.	National University of Science and Technology Politehnica Bucharest	Paper GitHub
Clementi: Efficient Load Balancing and Communication Overlap for Multi-FPGA Graph Processing Abstract "Efficient graph processing is critical in various modern applications, such as social network analysis, recommendation systems, and large-scale data mining. Traditional single-FPGA systems struggle to handle the increasing size and complexity of real-world graphs due to limitations in memory and computational resources. Existing multi-FPGA solutions face significant challenges, including high communication overhead caused by irregular data transfer patterns and workload imbalances stemming from skewed graph distributions. These inefficiencies hinder scalability and performance, highlighting a critical research gap. To address these issues, we introduce Clementi, an efficient multi-FPGA graph processing framework that features customized fine-grained pipelines for computation and cross-FPGA communication. Clementi uniquely integrates an accurate performance model for execution time prediction, enabling a novel scheduling method that balances workload distribution and minimizes communication overhead by overlapping communication and computation stages. Experimental results demonstrate that Clementi achieves speedups of up to 8.75x compared to state-of-the-art multi-FPGA designs, indicating significant improvements in processing efficiency as the number of FPGAs increases. This near-linear scalability underscores the framework's potential to enhance graph processing capabilities in practical applications."	Feng Yu et al.	NUS	Paper GitHub
Configurable DSP-Based CAM Architecture for Data-Intensive Applications on FPGAs Abstract "Content-addressable memory (CAM) is a type of fast memory unique in its ability to perform parallel searches of stored data based on content rather than specific memory addresses. They have been used in many domains, such as networking, databases, and graph processing. Field-programmable gate arrays (FPGAs) are an attractive platform for implementing CAMs because of their low latency, reconfigurability, and energy-efficient nature. However, such implementations also face significant challenges, including high resource utilization, limited scalability, and suboptimal performance due to the extensive use of look-up tables (LUTs) and block RAMs (BRAMs). These issues stem from the inherent limitations of FPGA architectures when handling the parallel operations required by CAMs, often leading to inefficient designs that cannot meet the demands of high-speed, data-intensive applications. To address these challenges, we propose a novel configurable CAM architecture that leverages the digital signal processing (DSP) blocks available in modern FPGAs as the core resource. By utilizing DSP blocks' data storage and logic capabilities, our approach enables configurable CAM architecture with efficient multi-query support while significantly reducing search and update latency for data-intensive applications. The DSP-based CAM architecture offers enhanced scalability, higher operating frequency, and improved performance compared to traditional LUT and BRAM-based designs. In addition, we demonstrate the effectiveness of our proposed CAM architecture with a triangle counting application on real graphs. This innovative use of DSP blocks also opens up new possibilities for high-performance, data-intensive applications on FPGAs."	Yao Chen et al.	NUS	Paper GitHub
Coyote v2: Raising the Level of Abstraction for Data Center FPGAs Abstract "In the trend towards hardware specialization, FPGAs play a dual role as accelerators for offloading, e.g., network virtualization, and as a vehicle for prototyping and exploring hardware designs. While FPGAs offer versatility and performance, integrating them in larger systems remains challenging. Thus, recent efforts have focused on raising the level of abstraction through better interfaces and high-level programming languages. Yet, there is still quite some room for improvement. In this paper, we present Coyote v2, an open source FPGA shell built with a novel, three-layer hierarchical design supporting dynamic partial reconfiguration of services and user logic, with a unified logic interface, and high-level software abstractions such as support for multithreading and multitenancy. Experimental results indicate Coyote v2 reduces synthesis times between 15% and 20% and run-time reconfiguration times by an order of magnitude, when compared to existing systems. We also demonstrate the advantages of Coyote v2 by deploying several realistic applications, including HyperLogLog cardinality estimation, AES encryption, and neural network inference. Finally, Coyote v2 places a great deal of emphasis on integration with real systems through reusable and reconfigurable services, including a fully RoCE v2-compliant networking stack, a shared virtual memory model with the host, and a DMA engine between FPGAs and GPUs. We demonstrate these features by, e.g., seamlessly deploying an FPGA-accelerated neural network from Python."	Benjamin Ramhorst and Maximilian J. Heer et al.	ETH Zurich	Paper
Coyote v2: Towards Open-Source, Reusable Infrastructure and Abstractions for FPGAs Abstract "We identify a number of requirements for future FPGA shells and describe Coyote v2, our first step towards a unified FPGA platform for cloud and datacenter acceleration, providing support similar to a conventional OS on a CPU"	Benjamin Ramhorst and Maximilian J. Heer et al.	ETH Zurich	Paper
Cuper: Customized Dataflow and Perceptual Decoding for Sparse Matrix-Vector Multiplication on HBM-Equipped FPGAs Abstract "Sparse matrix-vector multiplication (SpMV) is pivotal in many scientific computing and engineering applications. Considering the memory-intensive nature and irregular data access patterns inherent in SpMV, its acceleration is typically bounded by the limited bandwidth. Multiple memory channels of the emerging high bandwidth memory (HBM) provide exceptional bandwidth, offering a great opportunity to boost the performance of SpMV. However, ensuring high bandwidth utilization with low memory access conflicts is still non-trivial. In this paper, we present Cuper, a high-performance SpMV accelerator on HBM-equipped FPGAs. Through customizing the dataflow to be HBM-compatible with the proposed sparse storage format, the bandwidth utilization can be sufficiently enhanced. Furthermore, a two-step reordering algorithm and perceptual decoder-centric hardware architecture are designed to greatly mitigate read-after-write (RAW) conflicts, enhance the vector reusability and on-chip memory utilization. The evaluation of 12 large matrices shows that Cuper's geomean throughput outperforms the four latest SpMV accelerators HiSparse, GraphLily, Sextans, and Serpens, by 3.28x, 1.99x, 1.75x, and 1.44x, respectively. Furthermore, the geomean bandwidth efficiency shows 3.28x, 2.20x, 2.82x, and 1.31x improvements, while the geomean energy efficiency has 3.59x, 2.08x, 2.21x, and 1.44x optimizations, respectively. Cuper also demonstrates 2.51x throughput and 7.97x energy efficiency of improvement over the K80 GPU on 2,757 SuiteSparse matrices."	Enxin Yi et al.	China University of Petroleum-Beijing	Paper GitHub
Efficient and Distributed Computation of Electron Repulsion Integrals on AMD AI Engines Abstract "Computing electron repulsion integrals (ERIs) is the major computational bottleneck of many quantum mechanical simulation methods, requiring trillions of ERI evaluations per time step. While the computation of independent ERIs is embarrassingly parallel, the efficient computation of individual ERIs on modern processor cores is difficult due to both an insufficient cache size for intermediates of the computation and irregular memory access patterns that are difficult to vectorize. In this paper, we present how our implementation on the AI Engine (AIE) architecture addresses both of these problems. First, we have defined a flexible graph structure, which we call an ERI-Engine, that can be implemented for all 231 canonical ERI quartets from {ss\|ss} to {hh\|hh} by distributing the computation over 2-14 AIEs. Second, for the larger quartets, we have devised a novel vectorization scheme that leverages the advanced floating-point unit of the AIEs, while also supporting vectorization of independent ERIs for the smaller quartets. Finally, ERI-Engines are horizontally and vertically stackable to fill the entire AIE array, and in particular, the vertically stacked ERI-Engines form a column that uses one or more time-shared channels to stream the results out of the AIE array, almost completely hiding the computational phases of individual ERI-Engines. In terms of absolute performance, we are competitive with recent high-performance implementations of ERI algorithms on FPGAs (SERI) and GPUs (LibintX), as well as well-established highly optimized CPU libraries (Libint, Libcint), while being the unequivocal leader in terms of energy efficiency."	Johannes Menzel and Christian Plessl et al.	Paderborn University	Paper
Evaluating the Strong Scaling Potential of AI Engines for Molecular Dynamics Simulations Abstract "In molecular dynamics (MD) applications, there is a constant need to simulate ever longer timescales for a fixed number of particles. Here, novel spatial architectures such as the AI Engines (AIEs), with their low-latency interconnect between computational units, are promising candidates for breaking this strong scaling barrier. In this work, we have prototyped an MD simulation on the AIEs using cell lists that subdivide and map the simulation volume to AIEs for local computation. Between time steps, particles must be exchanged globally, for which we leveraged the on-chip switch network. We initially implemented this global communication using a ring topology, but a thorough evaluation revealed that this hindered strong scaling. Therefore, we are currently in the process of changing the global communication to a packet-switching and routing scheme to allow more direct particle exchange between AIEs. Here, we report first results and challenges posed by the architecture for an efficient implementation."	Mika Bröker et al.	Paderborn University	Paper
Exploring Large Language Models for Hierarchical Hardware Circuit and Testbench Generation Abstract "Designing and verifying hardware circuits using a Hardware Description Language (HDL) is an essential but time-consuming part of hardware design. Generating the desired correct circuit and testbench code usually requires a significant engineering effort. Recently, Large Language Models (LLMs) have claimed to have strong code generation capabilities to reduce such engineering costs. Existing work has provided quantitative evaluations using LLMs for single-module, simple circuit generation. However, it is still unclear whether modern LLMs are useful in production workflows, e.g., generating correct hierarchical circuits with testbenches. And if they are capable, what are the best prompt engineering practices for hardware design? We evaluate LLMs for HDL generation by exploring a 3-dimensional design space: commercial and open-source language models, single-module and hierarchical circuits, and prompting methods with varying complexity. We propose a 3-step design space exploration methodology to answer the two aforementioned questions. First, we explore the best prompt engineering practices across generating simple, middle, and hard single-module circuits with testbenches on CodeLLama-34B. We also define two fine-grained checklists to evaluate the circuit and testbench quality from a user's perspective. Second, we benchmark 11 LLMs with prompt adaptation on 4 single-module circuits that CodeLLama-34B has trouble with to further find models that may be useful in a production workflow. Third, we apply the learned prompt practices on four top-level models to generate simple 2 to 4-module and more complex multi-module hierarchical circuits and testbenches. As a result, we find that some of the latest LLMs can generate correct simple hierarchical circuits and testbenches with given proper prompts, but still struggle with complex hierarchical circuits. We further provide useful guidelines from an end-user's perspective on leveraging LLMs for hardware design."	Samuel Gomes Lopes et al.	ETH Zurich	Paper
Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal Abstract "Vector search systems are indispensable in large language model (LLM) serving, search engines, and recommender systems, where minimizing online search latency is essential. Among various algorithms, graph-based vector search (GVS) is particularly popular due to its high search performance and quality. However, reducing GVS latency by intra-query parallelization remains challenging due to limitations imposed by both existing hardware architectures (CPUs and GPUs) and the inherent difficulty of parallelizing graph traversals. To efficiently serve low-latency GVS, we co-design hardware and algorithm by proposing Falcon and Delayed-Synchronization Traversal (DST). Falcon is a hardware GVS accelerator that implements efficient GVS operators, pipelines these operators, and reduces memory accesses by tracking search states with an on-chip Bloom filter. DST is an efficient graph traversal algorithm that simultaneously improves search performance and quality by relaxing traversal orders to maximize accelerator utilization. Evaluation across various graphs and datasets shows that Falcon, prototyped on FPGAs, together with DST, achieves up to 4.3X and 19.5X lower latency and up to 8.0X and 26.9X improvements in energy efficiency over CPU- and GPU-based GVS systems."	Wenqi Jiang et al.	ETH Zurich	Paper
FINN-HPC: Closing the Gap for Energy-Efficient Neural Network Inference on FPGAs in HPC Abstract "In recent years, neural network (NN) training and inference have become one of the most impactful topics in society, research and industry. With the rise in popularity of NNs and the increase in their capabilities and size, new challenges surface, such as a high energy consumption. This paper presents a comprehensive evaluation of current NN inference soft- and hardware configurations within High-Performance Computing (HPC) environments, with a focus on both performance metrics and energy consumption. NN quantization and the use of accelerators such as FPGAs can improve throughput and energy efficiency. This paper focuses on FINN, an efficient NN inference framework for FPGAs, highlighting its current lack of support for HPC systems. We provide an in-depth analysis of FINN in order to implement extensions and a new user-space driver for FINN to optimize the end-to-end execution for the usage in the HPC environment. We thoroughly evaluate the performance and energy efficiency gains using the newly implemented optimizations for a NN requiring low latency and high throughput, as well as for a transformer model. The results are compared against existing NN accelerators for HPC. Our results show that with our newly developed FINN HPC driver, FPGAs have a real use case depending on input sizes and network. For the simpler model, we show that for smaller input sizes, FPGAs with our driver can provide up to 7.81× higher throughput and 88% lower energy consumption over the GPU reference, which however still performs better for larger input sizes. Additionally, we evaluate the newly implemented FINN HPC driver on the significantly more complex transformer based model. We show that for smaller input sizes, our optimizations results in an up 6.51× higher throughput, while at the same time reducing energy consumption by 87% and latency by 85%."	Linus Jungemann et al.	Paderborn University	Paper
HBMex: An Attachment for Nonbursting Accelerators to Enhance HBM Performance Abstract "Modern FPGAs integrate High Bandwidth Memory (HBM), offering up to 12× the DDR bandwidth distributed across multiple memory interfaces. To utilize the most of HBM's theoretical bandwidth, accelerators typically issue long bursts and exploit data locality. However, some applications like sparse matrix-vector multiplication (SpMV) and graph analytics often exhibit irregular, nonbursting memory access patterns, which hinder performance. Additionally, the HBM interconnect, essential for accessing multiple interfaces, may stall requests under certain conditions. This work introduces HBMex, a novel module designed to enhance HBM throughput for accelerators with irregular access patterns. Positioned between the accelerator and the HBM, HBMex improves parallelism by distributing memory requests across interfaces and mitigates stalls caused by the interconnect. We evaluate HBMex using memory access microbenchmarks and an SpMV accelerator, demonstrating throughput gains of up to 37% across real-world workloads compared to vendor-provided solutions. HBMex is distributed as a highly-configurable and open-source RTL generator."	Canberk Sönmez et al.	EPFL	Paper
Iceberg: Enhancing HLS Modeling with Synthetic Data Abstract "Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by 86.4% when adapt to six real-world applications with few-shot examples and achieves a 2.47× and a 1.12× better offline DSE performance when adapting to two different test datasets."	Zijian Ding et al.	UCLA	Paper GitHub
InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs Abstract "The rise of deep neural networks (DNNs) has driven an increased demand for computing power and memory. Modern DNNs exhibit high data volume variation (HDV) across tasks, which poses challenges for FPGA acceleration: conventional accelerators rely on fixed execution patterns (dataflow or sequential) that can lead to pipeline stalls or necessitate frequent off-chip memory accesses. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design methodology for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and problem sizes. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task HDV DNN kernels using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits 1.8× and 7.1× speedups correspondingly. Moreover, we extend InTAR to GPT-2 medium as a more complex example, which is 3.65 ~39.14× faster and a 1.72 ~ 10.44× more DSP efficient than SoTA accelerators (Allo and DFX) on FPGAs. Additionally, this design demonstrates 1.66 ~ 7.17× better power efficiency than GPUs."	Zifan He et al.	UCLA	Paper GitHub
Leda: Leveraging Tiling Dataflow to Accelerate SpMM on HBM-Equipped FPGAs for GNNs Abstract "Graph neural networks (GNNs) play a pivotal role in extracting insightful representations from graph-structured data, driving advancements across diverse domains. Central to GNNs is the sparse matrix-dense matrix multiplication (SpMM) kernel. However, challenges arise in accelerating SpMM due to the high sparsity and randomly distributed non-zeros in graph matrices. Recently, the high concurrency capability of high bandwidth memory (HBM) has provided a new opportunity for SpMM acceleration. Nonetheless, accelerating SpMM on HBM FPGAs is still non-trivial due to load imbalance and the random memory access patterns. In this paper, we present Leda, a high-performance SpMM accelerator on HBM-equipped FPGAs for GNNs. Leda utilizes a tiling sparse format to balance the workload between processing engines (PEs) and enhance data locality. The improved outer-product dataflow scheduling strategy mitigates the random memory access bottlenecks. Furthermore, we introduce a minimum similarity reordering algorithm to significantly optimize read-after-write (RAW) dependencies, thereby improving compute occupancy and throughput. Finally, we propose a highly parallel hardware architecture design based on customized dataflow and explore the reusability of input matrix. The evaluation demonstrates that Leda's geomean throughput and energy efficiency surpass Sextans and SDMA, the state-of-the-art SpMM accelerators on HBM FPGAs, and K80 GPU by 1.27x and 1.36x, 1.85x and 2.00x, 1.95x and 5.23x, respectively."	Enxin Yi et al.	China University of Petroleum-Beijing	Paper GitHub
LLM Acceleration on FPGAs: A Comparative Study of Layer and Spatial Accelerators Abstract "Large Language Models (LLMs) have emerged as the leading technique in Natural Language Processing due to their remarkable capabilities. These models are exceptionally complex, often utilising billions of parameters and consuming tens of gigabytes of memory unless optimised. Typically, LLMs are trained and accelerated using general-purpose Graphics Processing Units (GPUs), with a single inference potentially occupying an entire GPU. Consequently, cloud services may require hundreds of GPUs to deliver high-quality AI assistant services. This demand for extensive hardware acceleration allows exploring alternative architectures that offer improved execution time and energy efficiency.This study evaluates Field-Programmable Gate Arrays (FP-GAs), renowned for their flexibility and efficiency, to accelerate LLMs. We specifically compare the resource consumption and latency of two distinct architectures: layer and spatial accelerators. Analysing the Llama 2-7B model, we identified potential optimisation opportunities within its composition and operational graphs. Our most successful implementations demonstrate a performance improvement ranging from 1.37× to 10.98× over two AMD EPYC processors with 64 cores each. Moreover, our results indicate that, with further refinement, FPGAs have the potential to surpass GPU performance for LLM inference, showcasing their feasibility as a viable alternative for this demanding application."	Luis D. Prieto-Sibaja et al.	Costa Rica Institute of Technology	Paper
Machine Learning-based Deep Packet Inspection at Line Rate for RDMA on FPGAs Abstract "FPGAs are becoming increasingly important in the cloud and data centers, especially as network-attached accelerators or reconfigurable Network Interface Cards (NICs). In the cloud, Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2) has emerged as the de facto standard protocol for data transport due to its low latency and high throughput. However, RDMA has several access control weaknesses limiting its applicability in the cloud. In this paper, we explore using machine learning-based deep packet inspection (DPI) as an enhancement to an open-source FPGA RDMA stack. The ultra low-latency ML model is integrated on the RDMA datapath and allows for detection of specific content in RDMA payloads (e.g., executables) at a line rate of 100Gbps while using less than 1% of the available resources. Compared with existing work, our solution operates on the full message payload, at the transport level, and on a complete RDMA stack without sacrificing compatibility with RoCEv2 and its native performance characteristics, proving its potential as an end-to-end solution."	Maximilian Heer, Benjamin Ramhorst et al.	ETH Zurich	Paper
Neural Network Inference in High-Performance Computing: Closing the Gap for FINN based Reconfigurable Accelerators Abstract "In recent years, Neural Networks (NNs) have become one of the most prevailing topics in computers science, both in research and in industry. NNs are used for data analysis, natural language processing, autonomous driving and more. As such, NNs also see more application and use in High-Performance Computing (HPC). At the same time, energy efficiency has become an increasingly critical topic. NNs use large amounts of energy for operation, which in return results in large amounts of CO2 emissions. This work presents a comprehensive evaluation of current NN inference soft- and hardware configurations within High-Performance Computing (HPC) environments, with a focus on both performance metrics and energy consumption. NN quantization and accelerators such as FPGAs allow for an increased inference efficiency, both in terms of throughput and energy. Therefore, this work focuses on FINN, an efficient NN inference framework for FPGAs, highlighting its current lack of support for HPC systems. We provide an in-depth analysis of FINN in order to implement extensions to optimize the end-to-end execution for the usage in the HPC environment. We thoroughly evaluate the performance and energy efficiency gains using newly implemented optimizations and compare it against existing NN accelerators for HPC. With our extensions of FINN, we were able to achieve a 1847× higher throughput, while also decreasing the latency on average by 0.9978× and EDP by 0.9979× on an Alveo U55C FPGA. Data flow based NN inference accelerators on an FPGA should be used if the performance and energy footprint of the inference process is crucial, and the batch sizes are small to medium. For extremely large batch sizes and a very limited time for network-to-accelerator (less than a few days), using GPUs is still the way to go. Our results show that with the newly developed driver, we outperform a high-end Nvidia A100 GPU by up to 7.81x in throughput, while having a 0.87x lower latency and 0.88x lower energy delay product."	Linus Jungemann et al.	Paderborn University	Paper
OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design Abstract "OpenRTLSet1 introduces the largest fully open-source dataset for hardware design, offering over 127,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (98k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain."	Jinghua Wang et al.	UIUC	Paper
Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs Abstract "Machine learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat multiply-accumulators (MACCs) targeting the AMD VersalTM FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply integrated embedded ML solutions when trading off training and fine-tuning effort versus resource cost."	Hans Jakob Damsgaard et al.	Tampere University	Paper
Performance-Portable implementation of the shallow water equations on CPUs, GPUs, and FPGAs Abstract "The shallow water equations describe fluid flows where the horizontal length scales are much larger than the vertical scales. Applications for the shallow water equations can be found in the modelling of tides, estuaries, tsunamis, floods, or atmospheric flows. We present an implementation of the shallow water equations for coastal ocean domains, running on CPUs, GPUs, and FPGAs by utilizing SYCL as a platform-agnostic programming model."	Markus Büttner et al.	Paderborn University	Paper
QUEKUF: An FPGA Union Find Decoder for Quantum Error Correction on the Toric Code Abstract "Quantum computing represents an exciting computing paradigm that promises to solve problems untractable for a classical computer. The main limiting factor for quantum devices is the noise impacting qubits, which hinders the superpolynomial speedup promise. Thus, although Quantum Error Correction (QEC) mechanisms are paramount, QEC demands high speed and low latency to scale quantum computations to real-life-sized problems. Within this context, hardware accelerators, such as Field Programmable Gate Arrays (FPGAs), represent a valuable approach to fulfilling QEC requirements. Nevertheless, the literature falls short in proposing solutions targeting the toric code, a type of quantum Low-Density Parity Check code capable of encoding two logical qubits, thus requiring fewer physical qubits. This manuscript presents QUEKUF, an FPGA-based QEC dataflow architecture dealing with the toric code. QUEKUF disposes of parallel processing units to spatially parallelize QEC, which a centralized controller orchestrates for data movement and operation decisions. We also provide a latency-oriented resource optimization model to identify the best theoretical configuration of QUEKUF that minimizes latency and optimizes resource requirements based upon high-level quantum parameters. Experimental results show that QUEKUF attains up to 7.3x speedup and 81.51x improvement in energy efficiency over a C++ implementation with error-free syndromes while keeping high accuracy."	Federico Valentino et al.	Politecnico di Milano	Paper GitHub
Reconfigurable Stream Network Architecture Abstract "As AI systems grow increasingly specialized and complex, managing hardware heterogeneity becomes a pressing challenge. How can we efficiently coordinate and synchronize heterogeneous hardware resources to achieve high utilization? How can we minimize the friction of transitioning between diverse computation phases, reducing costly stalls from initialization, pipeline setup, or drain? Our insight is that a network abstraction at the ISA level naturally unifies heterogeneous resource orchestration and phase transitions. This paper presents a Reconfigurable Stream Network Architecture (RSN), a novel ISA abstraction designed for the DNN domain. RSN models the datapath as a circuit-switched network with stateful functional units as nodes and data streaming on the edges. Programming a computation corresponds to triggering a path. Software is explicitly exposed to the compute and communication latency of each functional unit, enabling precise control over data movement for optimizations such as compute-communication overlap and layer fusion. As nodes in a network naturally differ, the RSN abstraction can efficiently virtualize heterogeneous hardware resources by separating control from the data plane, enabling low instruction-level intervention. We build a proof-of-concept design RSN-XNN on VCK190, a heterogeneous platform with FPGA fabric and AI engines. Compared to the SOTA solution on this platform, it reduces latency by 6.1x and improves throughput by 2.4x–3.2x. Compared to the T4 GPU with the same FP32 performance, it matches latency with only 18% of the memory bandwidth. Compared to the A100 GPU at the same 7nm process node, it achieves 2.1x higher energy efficiency in FP32."	Chengyue Wang et al.	UCLA	Paper GitHub
RoCE BALBOA: Towards FPGA-enhanced RDMA Abstract "We explore the use of FPGA-based SmartNICs in the context of capability-enhanced RDMA, by combining a self-developed, open-source and fully RoCE v2-compatible networking stack with exemplaric hardware modules for AESencryption, snappy-compression, and machine learning-based Deep Packet Inspection (DPI) for access control that utilize stream computing to achieve 100G line rate speed without any performance overhead on the host CPU (Figure 1). Furthermore, our design allows to add any other user-defined application directly on the RDMA data streams, leveraging the reconfigurable fabric for offloading network processing from the host CPU to the FPGA-NIC at full line rate."	Maximilian J. Heer et al.	ETH Zurich	Paper
RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs Abstract "Data-intensive applications in data centers, especially machine learning (ML), have made the network a bottleneck, which in turn has motivated the development of more efficient network protocols and infrastructure. For instance, remote direct memory access (RDMA) has become the standard protocol for data transport in the cloud as it minimizes data copies and reduces CPU-utilization via host-bypassing. Similarly, an increasing amount of network functions and infrastructure have moved to accelerators, SmartNICs, and in-network computing to bypass the CPU. In this paper we explore the implementation and deployment of RoCE BALBOA, an open-source, RoCE v2-compatible, scalable up to hundred queue-pairs, and 100G-capable RDMA-stack that can be used as the basis for building accelerators and smartNICs. RoCE BALBOA is customizable, opening up a design space and offering a degree of adaptability not available in commercial products. We have deployed BALBOA in a cluster using FPGAs and show that it has latency and performance characteristics comparable to commercial NICs. We demonstrate its potential by exploring two classes of use cases. One involves enhancements to the protocol for infrastructure purposes (encryption, deep packet inspection using ML). The other showcases the ability to perform line-rate compute offloads with deep pipelines by implementing commercial data preprocessing pipelines for recommender systems that process the data as it arrives from the network before transferring it directly to the GPU. These examples demonstrate how BALBOA enables the exploration and development of SmartNICs and accelerators operating on network data streams."	Maximilian J. Heer et al.	ETH Zurich	Paper
Rock the QASBA: Quantum Error Correction Acceleration via the Sparse Blossom Algorithm on FPGAs Abstract "Quantum computing is a new paradigm of computation that exploits principles from quantum mechanics to achieve an exponential speedup compared to classical logic. However, noise strongly limits current quantum hardware, reducing achievable performance and limiting the scaling of the applications. For this reason, current noisy intermediate-scale quantum devices requireQuantum Error Correction (QEC) mechanisms to identify errors occurring in the computation and correct them in real time. Nevertheless, the high computational complexity of QEC algorithms is incompatible with the tight time constraints of quantum devices. Thus, hardware acceleration is paramount to achieving real-time QEC. This work presents QASBA, an FPGA-based hardware accelerator for the Sparse Blossom Algorithm (SBA), a state-of-the-art decoding algorithm. After profiling the state-of-the-art software counterpart, we developed a design methodology for hardware development based on the SBA. We also devised an automation process to help users without expertise in hardware design in deploying architectures based on QASBA. We implement QASBA on different FPGA architectures and experimentally evaluate resource usage, execution time, and energy efficiency of our solution. Our solution attains up to 25.05× speedup and 304.16× improvement in energy efficiency compared to the software baseline."	Marco Venere et al.	Politecnico di Milano	Paper
SAT-Accel: A Modern SAT Solver on a FPGA Abstract "Boolean satisfiability (SAT) solving is the first known NP-complete problem and is widely used in many application domains. Over the years, there have been so many consistent improvements in this area such that larger instances can be solved relatively quickly. Although these improvements have found their way onto CPU implementations, there has been limited progress adopting this on hardware accelerators mainly because it is difficult to implement the dynamic data structures needed to support a modern SAT solving algorithm. In this work, we present SAT-Accel, an algorithm-hardware co-design solver that applies many of the core improvements found in modern SAT solvers. SAT-Accel uses a novel memory management system and representation that supports the dynamic data structures required by a modern SAT solving algorithm. Our design can achieve on average a 17.9x speedup against MiniSat, the previous state of the art CPU solver, and a 2.8x speedup against Kissat, the current state of the art CPU solver. Compared to the current state-of-the-art stand-alone hardware accelerator, SAT-Hard, SAT-Accel achieves on average 800.0x speedup"	Michael Lo et al.	UCLA	Paper
Self-Validated Learning for Particle Separation: A Correctness-Based Self-Training Framework Without Human Labels Abstract "Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. "	Philipp D. Lösel et al.	Australian National University	Paper
Soaring with TRILLI: An HW/SW Heterogeneous Accelerator for Multi-Modal Image Registration Abstract "3D rigid image registration is a pivotal procedure in computer vision that aligns a floating volume with a reference one to correct positional and rotational distortions. It serves either as a stand-alone process or as a pre-processing step for non-rigid registration, where the rigid part dominates the computational cost. Various hardware accelerators have been proposed to optimize its compute-intensive components: geometric transformation with interpolation and similarity metric computation. However, existing solutions fail to address both components effectively, as GPUs excel at image transformation, while FPGAs in similarity metric computation. To close this gap, we propose TRILLI, a novel Versal-based accelerator for image transformation and interpolation. TRILLI optimally maps each computational step on the proper heterogeneous hardware component. TRILLI achieves speedup of 5.32x against the top performing GPU-based solution, and an energy efficiency improvement of 36.75x against the most efficient one. Moreover, we integrate it with an FPGA-based similarity metric from literature to complete a rigid image registration step (i.e., transformation, interpolation, and similarity metric) attaining a speedup of 18.6x against the top performing GPU-based solution, while being 36.11x more efficient than the most energy efficient one."	Giuseppe Sorrentino et al.	Politecnico di Milano	Paper
Stream-HLS: Towards Automatic Dataflow Acceleration Abstract "High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in hardware design. Further, the hardware design space, especially for multi-kernel applications, grows exponentially. Therefore, several HLS automation and abstraction frameworks have been proposed recently, but many issues remain unresolved. These issues include: 1) relying mainly on hardware directives (pragmas) to apply hardware optimizations without exploring loop scheduling opportunities. 2) targeting single-kernel applications only. 3) lacking automatic and/or global design space exploration. 4) missing critical hardware optimizations, such as graph-level pipelining for multi-kernel applications. To address these challenges, we propose a novel methodology and framework on top of the popular multi-level intermediate representation (MLIR) infrastructure called Stream-HLS. Our framework takes a C/C++ or PyTorch software code and automatically generates an optimized dataflow architecture along with host code for field-programmable gate arrays (FPGAs). To achieve this, we developed an accurate analytical performance model for global scheduling and optimization of dataflow architectures. Stream-HLS is evaluated using various standard HLS benchmarks and real-world benchmarks from transformer models, convolution neural networks, and multilayer perceptrons. Stream-HLS designs outperform the designs of prior state-of-the-art automation frameworks and manually-optimized designs of abstraction frameworks by up to 79.43× and 10.62× geometric means respectively"	Suhail Basalama et al.	UCLA	Paper GitHub
StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs Abstract "Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve efficiency, existing approaches struggle with inter-kernel correlations, external memory access management, and buffer optimization. In this work, we propose StreamTensor, a compiler framework that automatically constructs and optimizes stream-based dataflow accelerators. StreamTensor introduces a novel iterative tensor type system to explicitly encode stream layouts, enabling seamless kernel fusion, buffer allocation, and memory optimization. By systematically exploring three hierarchical design spaces, including tensor tiling, kernel fusion, and resource allocation, StreamTensor balances computational intensity, memory efficiency, and data streaming to maximize performance. Based on FPGA evaluations on Large Language Models (LLM), StreamTensor achieves up to 0.76x and 0.64x lower latency compared to the state-of-the-art FPGA LLM accelerators and GPUs, and up to 1.99x higher energy efficiency compared to GPUs, making it a promising approach for scalable dataflow-based deep learning acceleration."	Hanchen Ye et al.	UIUC	Paper
SwiftSpatial: Spatial Joins on Modern Hardware Abstract "Spatial joins are among the most time-consuming spatial queries, remaining costly even in parallel and distributed systems. In this paper, we explore hardware acceleration for spatial joins by proposing SwiftSpatial, an FPGA-based accelerator that can be deployed in data centers and at the edge. SwiftSpatial contains multiple high-performance join units with innovative hybrid parallelism, several efficient memory management units, and an extensible on-chip join scheduler that supports the popular R-tree synchronous traversal and partition-based spatial-merge (PBSM) algorithms. Benchmarked against various CPU and GPU-based spatial data processing systems, SwiftSpatial demonstrates a latency reduction of up to 41.03x relative to the best-performing baseline, while requiring 6.16x less power. The performance and energy efficiency of SwiftSpatial demonstrate its potential to be used in a variety of configurations (e.g., as an accelerator, near storage, in-network) as well as on different devices (e.g., data centers where FPGAs are widely available or mobile devices, which also contain FPGAs for specialized processing)."	Wenqi Jiang et al.	ETH Zurich	Paper
Towards a Methodology to Leverage Alveo Versal System Usability And Parallelization Abstract "Versal system arises to deliver an energy-efficient on-field reconfigurable fabric with the parallel floating-point computations of AI Engines (AIE). Despite the success of various implementations, the steep learning curve and the lack of a well-defined programming model make them inaccessible to non-experts. Therefore, we present a methodology and a parallelization strategy automatized through an automation framework. Applying the proposed methodologies to the case of similarity metrics computations, we attain over 13x speedup with respect to a single AI Engine-based solution, leading to 6x speedup and 8x energy efficiency improvement against software solutions."	Federico Mansutti et al.	Politecnico di Milano	Paper GitHub
VOTED: Versal optimization toolkit for education and heterogeneous systems development Abstract "Despite classic educational approaches proving their effectiveness for classic hardware acceleration, they suffer novel heterogeneous systems such as Versal, requiring a deeper system-level awareness. Versal devices merge FPGA on-field programmability with the performance of hardened VLIW processors, namely AI Engine, at the cost of facing novel challenges when integrating these two different layers. Given the system complexity of such devices and the low-level knowledge required to leverage them, Versal potentialities have yet to be fully exploited. Therefore, we present VOTED, a Versal Optimization Toolkit for Education and Heterogeneous Systems Development, that guides users of any expertise in discovering, using, and optimizing Versal-based applications. VOTED proved to be effective, leading five different groups of students, at their first experience with heterogeneous system design, at devising quite complex Versal-based applications capable of reaching the final stages of a design competition and even winning it with four months of work"	Giuseppe Sorrentino et al.	Politecnico di Milano	Paper GitHub

2024

Title & Abstract	Author(s)	Institution	Link
Acceleration of Graph Neural Networks with Heterogenous Accelerators Architecture Abstract "Graph Neural Networks (GNNs) have been used to solve complex problems of drug discovery, social media analysis, etc. Meanwhile, GPUs are becoming dominating accelerators to improve deep neural network performance. However, due to the characteristics of graph data, it is challenging to accelerate GNN-type workloads with GPUs alone. GraphSAGE is one representative GNN workload that uses sampling to improve GNN learning efficiency. Profiling the GraphSAGE using PyG library reveals that the sampling stage on the CPU is the bottleneck. Hence, we propose a heterogeneous system architecture solution with the sampling algorithm accelerated on customizable accelerators (FPGA), and feed sampled data into GPU training through a PCIe Peer-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampling stage alone, we achieve a speed-up of 2.38× to 8.55× compared with sampling on CPU. For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24× to 1.99 ×"	Kaiwen Cao et al.	UIUC	Paper
ACCL+: an FPGA-Based Collective Engine for Distributed Applications Abstract "FPGAs are increasingly prevalent in cloud deployments, serving as SmartNICs or network-attached accelerators. To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source, FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the entire design. We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and its competitive performance for CPU applications. We showcase ACCL+'s dual role with two use cases: as a collective offload engine to distribute CPU-based vector-matrix multiplication, and as a component in designing fully FPGA-based distributed deep-learning recommendation inference"	Zhenhao He et al.	ETH Zurich	Paper
An HTTP Server for FPGAs Abstract "The computer architecture landscape is being reshaped by the new opportunities, challenges, and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this article, we present Strega, an open source1 light-weight Hypertext Transfer Protocol (HTTP) server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single Strega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16, μs, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications"	Fabio Maschi et al.	ETH Zurich	Paper
A survey of graph convolutional networks (GCNs) in FPGA-based accelerators Abstract "This survey overviews recent Graph Convolutional Networks (GCN) advancements, highlighting their growing significance across various tasks and applications. It underscores the need for efficient hardware architectures to support the widespread adoption and development of GCNs, particularly focusing on platforms like FPGAs known for their performance and energy efficiency. This survey also outlines the challenges in deploying GCNs on hardware accelerators and discusses recent efforts to enhance efficiency. It encompasses a detailed review of the mathematical background of GCNs behind inference and training, a comprehensive review of recent works and architectures, and a discussion on performance considerations and future directions."	Marco Procaccini et al.	University of Siena	Paper
BitBlender: Scalable Bloom Filter Acceleration on FPGAs with Dynamic Scheduling Abstract "The Bloom filter is one of the most widely used data structures in big data analytics to efficiently filter out vast amounts of noisy data. Unfortunately, prior Bloom filter designs only focus on single-input-stream acceleration, and can no longer match the increasing data rates offered by modern networks. To support large Bloom filters with low false-positive rate and high throughput, we present BitBlender, a configurable and scalable multi-input-stream Bloom filter acceleration framework in HLS. To effectively share one large bit-vector on chip among all streams, we design and implement the novel arbiter and unshuffle modules to dynamically schedule conflicting accesses to execute sequentially and non-conflicting accesses to execute in parallel. To support different user configurations of the Bloom filter, we also develop an automation flow, together with an accurate performance estimator, to automatically generate the best BitBlender design. Experimental results show that, on the AMD/Xilinx Alveo U280 FPGA, BitBlender achieves a throughput up to 2,194 MQueries/s (i.e., 8.8 GB/s) for a 96Mb bit-vector with 0.01% false-positive rate. It achieves up to 10.4x speedup over a 24-thread CPU implementation and up to 4.9x speedup over a naively-duplicated multi-stream FPGA design"	Kenneth Liu et al.	Simon Fraser University	Paper GitHub
DETECTive: Machine Learning-driven Automatic Test Pattern Prediction for Faults in Digital Circuits Abstract "Due to the continuous technology scaling and the ever-increasing complexity and size of the hardware designs, manufacturing defects have become a key obstacle in meeting end-user demand. Despite decades of research, traditional test-generation techniques often struggle to scale to massive and complex designs. Such scalability issues stem from the numerous backtracking the traditional test generation techniques perform before converging to a test pattern. In this work, we present DETECTive that leverages deep learning on graphs to learn fault characteristics and predict test pattern(s) to expose faults without requiring backtracking. DETECTive is trained on small circuits, and its learned knowledge is transferable to predict test patterns for circuits that contain up to 29 × more gates than the training circuits. Since DETECTive avoids backtracking completely, it can predict test patterns up to 15 × faster than academic tools and up to 2 × faster than commercial tools. DETECTive achieves up to 100% pattern accuracy on synthetic designs and up to 95% test pattern accuracy on realistic designs. To our knowledge, DETECTive is the first to leverage deep learning to predict test patterns for digital hardware designs that can complement the traditional test generation techniques for faster design closure."	Vincenzo Petrolo et al.	UIUC	Paper
Developing a BLAS library for the AMD AI Engine Abstract "Spatial (dataflow) computer architectures can mitigate the control and performance overhead of classical von Neumann architectures such as traditional CPUs. Driven by the popularity of Machine Learning (ML) workloads, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts. We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model"	Tristan Laans et al.	Vrije Universiteit Amsterdam	Paper
Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data Abstract "Federated Learning (FL) is an evolving paradigm that enables multiple parties to collaboratively train models without sharing raw data. Among its variants, Vertical Federated Learning (VFL) is particularly relevant in real-world, cross-organizational collaborations, where distinct features of a shared instance group are contributed by different parties. In these scenarios, parties are often linked using fuzzy identifiers, leading to a common practice termed as multi-party fuzzy VFL. Existing models generally address either multi-party VFL or fuzzy VFL between two parties. Extending these models to practical multi-party fuzzy VFL typically results in significant performance degradation and increased costs for maintaining privacy. To overcome these limitations, we introduce the Federated Transformer (FeT), a novel framework that supports multi-party VFL with fuzzy identifiers. FeT innovatively encodes these identifiers into data representations and employs a transformer architecture distributed across different parties, incorporating three new techniques to enhance performance. Furthermore, we have developed a multi-party privacy framework for VFL that integrates differential privacy with secure multi-party computation, effectively protecting local representations while minimizing associated utility costs. Our experiments demonstrate that the FeT surpasses the baseline models by up to 46% in terms of accuracy when scaled to 50 parties. Additionally, in two-party fuzzy VFL settings, FeT also shows improved performance and privacy over cutting-edge VFL modelstings"	Zhaomin Wu et al.	NUS	Paper GitHub
FlexiMem: Modular and Reconfigurable Virtual Memory Abstract "Shared Virtual Memory (SVM) is a mechanism that allows host-side applications and FPGA accelerators to access the same virtual address space. It enables accelerating algorithms with unpredictable memory access patterns by making transparent pointer sharing possible. Even for applications with predictable memory access patterns, SVM helps by eliminating manual data movement. FlexiMem is a customizable SVM system that uses FPGA-addressable memory resources to store virtual memory pages, rather than directly accessing the host memory, to achieve high throughput and low latency. On the FPGA side, FlexiMem features a highly-flexible interconnect that, for the first time, allows for configuring address and payload data paths independently for SVM systems. It supports multiple master and slave devices sharing the same address space, such as accelerators issuing many memory requests in parallel to multiple memory banks. With our interconnect design, we provide numerous specialization dimensions to optimize the SVM system for a given workload. For example, irregular applications with short accesses to memory require an address translation for each data transfer. Such applications can take advantage of a highly parallelized address data path and an increased number of TLBs working in parallel. In contrast, regular and bursting applications transfer a small number of address packets per many data packets, and they can tolerate a lightweight address data path with a single translation unit. FlexiMem also provides address translation units with reconfigurable capacities and page sizes to be tailored to the needs of the application. On the host side, FlexiMem leverages the Linux userfaultfd API and vendor-provided IPs and drivers for automatic data movement initiated by software. Blocks of memory allocated by the FlexiMem API can be passed freely to the FPGA or other host-side libraries, without requiring any kind of explicit data movement. We evaluate several experimental setups with various FlexiMem configurations to showcase the effect of customizability on performanc"	Canberk Sönmez et al.	EPFL	Paper
FSSD: FPGA-Based Emulator for SSDs Abstract "Solid State Drives (SSDs) have become increasingly popular due to their superior access latency and bandwidth compared to Hard Disk Drives (HDDs). However, to fully understand the impact of SSD design and microarchitecture on end-to-end application performance, researchers need to move beyond treating SSDs as black-box components. Unfortunately, purchasing multiple SSDs for research is expensive and ineffective since the underlying microarchitecture is still unknown to the system designer. While simulators have become the most popular method for studying SSDs, existing software-based simulators lack real data transfers and cannot simulate the latency from NVMe and PCIe interfaces. Additionally, simulating the entire SSD using software codes is time-consuming and limits the number of experiments that can be run in a reasonable amount of time. To address these issues, we present FSSD, an FPGA-based emulation system that models the latency and access patterns of an actual NVMe SSD. FSSD takes advantage of the flexibility of an FPGA, enabling users to customize SSD microarchitecture features and explore the design space for data-intensive applications. FSSD can be interacting with real operating systems, instead of relying on Virtual Machines like most other software simulators do. Evaluations show that FSSD provides over 1000x speedup compared to software-based simulation using the SimpleSSD simulator. The ability to customize SSD parameters and emulate NAND latency with high precision makes FSSD a valuable platform for SSD research and development. FSSD is also open-sourced to benefit the research community."	Luyang Yu et al.	UIUC	Paper GitHub
HAL: Hardware-assisted Load Balancing for Energy-efficient SNIC-Host Cooperative Computing Abstract "A typical SmartNIC (SNIC) integrates a processor comprising Arm CPU and accelerators with a conventional NIC. The processor is designed to energy-efficiently execute network functions frequently used by datacenter applications. With such a processor, the SNIC has promised to notably improve the system-wide energy efficiency of datacenter servers. Nevertheless, the latest trend of integrating accelerators into server CPUs for these functions sparks a question on the SNIC processor’s superiority over a host processor (i.e., server CPU with accelerators) in system-wide energy efficiency, especially under given tail latency constraints. Answering this question, we first take an Intel Xeon processor, integrated with various accelerators (e.g., QuickAssist Technology), as a host processor, and then compare it to an NVIDIA BlueField-2 SNIC processor. This uncovers that (1) the host accelerator, coupled with a more powerful memory subsystem, can outperform the SNIC accelerator, and (2) the SNIC processor can improve system-wide energy efficiency only at low packet rates for most functions under tail latency constraints. To provide high system-wide energy efficiency without compromising tail latency at any packet rates, we propose HAL, consisting of a hardware-based load balancer and an intelligent load balancing policy implemented inside the SNIC. When HAL determines that the SNIC processor cannot efficiently process a given function beyond a specific packet rate, it limits the rate of packets to the SNIC processor and lets the host processor handle the excess. We implement a HAL-enabled SNIC with a commodity FPGA and a BlueField-2 SNIC, plug it into a commodity server, and run 10 popular network functions. Our evaluation shows that HAL can improve the system-wide energy efficiency and throughput of the server running these functions by 31% and 10%, respectively, without notably increasing the tail latency"	Jinghan Huang et al.	UIUC, Duke University AND IBM	Paper
HashGrid: : An optimized architecture for accelerating graph computing on FPGAs Abstract "Large-scale graph processing poses challenges due to its size and irregular memory access patterns, causing performance degradation in common architectures, such as CPUs and GPUs. Recent research includes accelerating graph processing using Field Programmable Gate Arrays (FPGAs). FPGAs can provide very efficient acceleration thanks to reconfigurable on-chip resources. Although limited, these resources offer a larger design space than CPUs and GPUs. We propose an approach in which data are preprocessed in small chunks with an optimized graph partitioning technique for execution on FPGA accelerators. The chunks, located on the host, are streamed directly into a customized memory layer implemented in the FPGA, which is tightly coupled with the processing elements responsible for the graph algorithm execution. This improves application memory access latency, which is crucial in large-sale graph computing performance. This work presents a hardware design that, combined with graph partitioning, enables us to achieve high-performance and potentially scalable handling of large graphs (i.e., graphs with millions of vertices and billions of edges in current scenarios) while using popular graph algorithms. The proposed framework accelerates performance 56 times compared with CPU (multicore with 16 logical cores in our reference experiments), 2.5 times and 4 times faster compared to state-of-the-art FPGA and GPU solutions (FPGA has 15 compute units, and GPU reference has 128 streaming-multiprocessors in our experiments), respectively, when using the PageRank algorithm. For the Single-Source-Shortest-Past (SSSP) algorithm, we achieve speedups of up to 65x, 26x, and 18x compared to CPU, GPU, and FPGA works, respectively. Lastly, in the context of the Weakly Connected Component (WCC) algorithm, our framework achieves a speedup of up to 403 times compared to the CPU, 7.4x against the GPU, and it is faster than the FPGA alternatives up to 10.3x."	Amin Sahebi et al.	University of Siena	Paper
HIDA: A Hierarchical Dataflow Compiler for High-Level Synthesis Abstract "Dataflow architectures are growing in popularity due to their potential to mitigate the challenges posed by the memory wall inherent to the Von Neumann architecture. At the same time, high-level synthesis (HLS) has demonstrated its efficacy as a design methodology for generating efficient dataflow architectures within a short development cycle. However, existing HLS tools rely on developers to explore the vast dataflow design space, ultimately leading to suboptimal designs. This phenomenon is especially concerning as the size of the HLS design grows. To tackle these challenges, we introduce HIDA1, a new scalable and hierarchical HLS framework that can systematically convert an algorithmic description into a dataflow implementation on hardware. We first propose a collection of efficient and versatile dataflow representations for modeling the hierarchical dataflow structure. Capitalizing on these representations, we develop an automated optimizer that decomposes the dataflow optimization problem into multiple levels based on the inherent dataflow hierarchy. Using FPGAs as an evaluation platform, working with a set of neural networks modeled in PyTorch, HIDA achieves up to 8.54× higher throughput compared to the state-of-the-art (SOTA) HLS optimization tool. Furthermore, despite being fully automated and able to handle various applications, HIDA achieves 1.29× higher throughput over the SOTA RTL-based neural network accelerators on an FPGA"	Hanchen Ye et al.	UIUC	Paper
HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory Abstract "The multiplication of a sparse matrix with a dense vector is a vital operation in linear algebra, with applications in numerous contexts. After earlier research on FPGA acceleration of this operation had shown the potential to achieve a high bandwidth efficiency, this workload has received renewed attention with the introduction of high bandwidth memory on FPGA platforms. However, previous designs fell short of scaling to the full bandwidth potential of current FPGA platforms with high bandwidth memory. In this work, we present HiHiSpMV with a novel design approach around hierarchical accumulation, which allows us to overcome several limitations of the related work. Our design is completely implemented with high-level synthesis and compiled with Vitis, and instantiates 16 independent compute units, each processing up to 16 matrix elements per clock cycle. The reduction of these elements is performed without any latency or resource overhead and the subsequent accumulation uses the well-established shift-register design pattern. Due to the independent nature of compute units, our design can connect to all of the 32 high bandwidth memory pseudo-channels in 512-bit interface mode on an Alveo U280 FPGA board. In our tests, we reach up to 86% of the theoretical available bandwidth of this FPGA platform, enabling a computational throughput of up to 98 GFLOPS. This is about 1.5x faster than the peak throughput of the best related work in that regard, Serpens. On average, the throughput of HiHiSpMV is even 2.7x higher than Serpens"	Abdul Rehman Tareen et al.	Paderborn University	Paper
HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures Abstract "The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance evaluation framework for dataflow architectures that both supports early exploration of the design space and shortens the performance evaluation cycle. We apply the methodology to GNNHLS, an HLS-based graph neural network benchmark containing 6 commonly used graph neural network models and 4 datasets with distinct topologies and scales. The results show that HLPerf achieves over 10,000x average simulation acceleration relative to RTL simulation and over 400x acceleration relative to state-of-the-art cycle-accurate tools at the cost of 7% mean error rate relative to actual FPGA implementation performance. This acceleration positions HLPerf as a viable component in the design cycle"	Chenfeng Zhao et al.	Washington University in St. Louis	Paper
Integrating Multi-FPGA Acceleration to OpenMP Distributed Computing Abstract "Designing high-performance scientific applications has become a time-consuming and complex task that requires developers to master multiple frameworks and toolchains. Although re-configurability and energy efficiency make FPGA a powerful accelerator, efficiently integrating multiple FPGAs into a distributed cluster is a complex and cumbersome task. Such complexity grows considerably when applications require partitioning execution among CPUs, GPUs, and FPGAs. This paper introduces FPGA offloading support to OpenMP cluster (OMPC), an OpenMP-only framework capable of transparently offloading computation across nodes in a cluster, which reduces developer effort and time to solution. In addition, OMPC enables true heterogeneity by allowing the programmer to assign program kernels to the most appropriate architecture (CPUs, GPUs, or FPGA), depending on their workload characteristics. This is achieved by adding only a few lines of standard OpenMP code to the application. The resulting framework was applied to the heterogeneous acceleration of an image recoloring application. Experimental results demonstrate speed-ups gains using different acceleration arrangements with CPU, GPU and FPGA. Measurements using Halstead metrics show that the proposed framework is faster to program. Furthermore, the solution enables transparently offloading OMPC communication tasks to multiple FPGAs, which results in speed-ups of up to 1.41x over the default communication mechanism (Message Passing Interface - MPI) on Task Bench, a synthetic benchmark for task parallelism"	Pedro Henrique Rosso et al.	UNICAMP	Paper
Invited: New Solutions on LLM Acceleration, Optimization, and Application Abstract "Large Language Models (LLMs) have revolutionized a wide range of applications with their strong human-like understanding and creativity. Due to the continuously growing model size and complexity, LLM training and deployment have shown significant challenges, which often results in extremely high computational and storage costs and energy consumption. In this paper, we discuss the recent advancements and research directions on (1) LLM algorithm-level acceleration, (2) LLM-hardware co-design for improved system efficiency, (3) LLM-to-accelerator compilation for customized LLM accelerators, and (4) LLM-aided design for HLS (High-Level Synthesis) functional verification. For each aspect, we present the background study, our proposed solutions, and future directions"	Yingbing Huang et al.	UIUC	Paper
LevelST: Stream-based Accelerator for Sparse Triangular Solver Abstract "Over the past decade, much progress has been made to advance the acceleration of sparse linear operators such as SpMM and SpMV on FPGAs. Nevertheless, few works have attempted to address sparse triangular solver (SpTRSV) acceleration, and the performance boost is limited. SpTRSV is an elementary linear operator for many numerical methods, such as the least-square method. These methods, among others, are widely used in various areas, such as physical simulation and signal processing. Therefore, accelerating SpTRSV is crucial. However, many challenges impede accelerating SpTRSV, including (1) resolving dependencies between elements during forward or backward substitutions, (2) random access and unbalanced workloads across memory channels due to sparsity, (3) latency incurred by off-chip memory access for large matrices or vectors, and (4) data reuse for an unpredictable data sharing pattern. To address these issues, we have designed LevelST, the first FPGA accelerator leveraging high bandwidth memory (HBM) for solving sparse triangular systems. LevelST features (1) algorithm-hardware co-design of stream-based dependency resolution with reduced off-chip data movement, (2) resource sharing that improves resource utilization to scale up the architecture, (3) index modulo scheduling to balance workload, and (4) selective data prefetching from off-chip memory. LevelST is prototyped on an AMD Xilinx U280 HBM FPGA and evaluated with 16 sparse triangular matrices. Compared with the NVIDIA V100 and RTX 3060 GPUs over the cuSPARSE library, LevelST achieves a 2.65x speedup and 9.82x higher energy efficiency than the best of the V100 GPU and RTX 3060 GPU"	Zifan He et al.	UCLA	Paper GitHub
Noctua 2 Supercomputer Abstract "Noctua 2 is a supercomputer operated at the Paderborn Center for Parallel Computing (PC2) at Paderborn University in Germany. Noctua 2 was inaugurated in 2022 and is an Atos BullSequana XH2000 system. It consists mainly of three node types: 1) CPU Compute nodes with AMD EPYC processors in different main memory configurations, 2) GPU nodes with NVIDIA A100 GPUs, and 3) FPGA nodes with Xilinx Alveo U280 and Intel Stratix 10 FPGA cards. While CPUs and GPUs are known off-the-shelf components in HPC systems, the operation of a large number of FPGA cards from different vendors and a dedicated FPGA-to-FPGA network are unique characteristics of Noctua 2. This paper describes in detail the overall setup of Noctua 2 and gives insights into the operation of the cluster from a hardware, software and facility perspective"	Carsten Bauer et al.	Paderborn University	Paper
MorphStream: Scalable Processing of Transactions over Streams Abstract "In the realm of transactional stream processing (TSP), the challenge lies in providing a unified execution model that seamlessly integrates transactional and stream-oriented capabilities. Existing TSP engines (TSPEs) largely employ non-adaptive scheduling techniques, leaving multicore parallelism underutilized due to intricate workload dependencies. We demonstrate MorphStream, a state-of-the-art TSPE built for unprecedented scalability on multicores. MorphStream distinguishes itself by employing an adaptive scheduling algorithm, explicitly designed to unlock the full potential of multicore architectures even under complex workload conditions. This enables MorphStream to make optimal trade-offs in performance metrics under varying workload characteristics. To enhance user engagement, the demonstration will showcase MorphStream's graphical user interface, specifically engineered to simplify the implementation and deployment of complex streaming applications while providing detailed and comprehensive performance monitoring and analytics for the job execution runtime."	Siqi Xiang et al.	Singapore University of Technology and Design	Paper
Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL Abstract "Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication"	Marius Meyer et al.	Paderborn University	Paper
P4-based In-Network Telemetry for FPGAs in the Open Cloud Testbed and FABRIC Abstract "In recent years, Field Programmable Gate Arrays (FPGAs) have gained prominence in cloud computing data centers, driven by their capacity to offload compute-intensive tasks and contribute to the ongoing trend of data center disaggregation, as well as their ability to be directly connected to the network. While FPGAs offer numerous advantages, they also pose challenges in terms of configuration, programmability, and monitoring, particularly in the absence of an operating system with essential features like the TCP/IP networking stack. This paper introduces an In-band Network Telemetry (INT) approach based on the P4 language for FPGA data plane programming. The goal is to facilitate monitoring and network performance analysis by providing one-way packet delay information. The approach is demonstrated in the Open Cloud Testbed (OCT) and FABRIC testbeds, both offering open access to the research community with greater FPGA availability than commercial clouds. The workflow enables researchers to create custom P4 programs and bitstreams for installation on FPGAs. The paper presents a multi-step approach allowing experimentation within the New England Research Cloud (NERC), testing in OCT, and final deployment in FABRIC, well-suited for one-way delay measurements due to synchronized clocks via GPS time signals. Contributions include the provision of a P4 workflow for FPGAs in a research cloud, a novel FPGA clock-based INT approach, and a comprehensive evaluation through simulation and experiments in the Open Cloud and FABRIC testbeds"	Sandeep Bal et al.	University of Massachusetts Amherst and Northeastern University	Paper
Realizing Network-Attached FPGAs in the Cloud Abstract "The Open Cloud Testbed (OCT) has the goal to provide bare-metal resources to the systems research community. In contrast to other infrastructures that offer such resources, the field-programmable gate arrays (FPGAs) in OCT expose their network interface directly to the experimenter, unlocking new possibilities for researchers focusing on cloud application acceleration and data center disaggregation. OCT provides two models for programming network-attached FPGAs. The first grants users direct access to the network interface. The features of this approach are demonstrated with examples that make use of User Datagram Protocol (UDP) and Transmission Control Protocol stacks. An example involving UDP is highlighted that facilitates a distributed machine learning (ML) application. The second model allows FPGA network interface programming via Programming Protocol-Independent Packet Processors, thus providing the FPGA with smart network interface card capabilities. In addition, this article outlines future research directions into network-attached FPGAs, including the security of such systems"	Miriam Leeser et al.	Northeastern University	Paper
SERI: High-Throughput Streaming Acceleration of Electron Repulsion Integral Computation in Quantum Chemistry using HBM-based FPGAs Abstract "The computation of electron repulsion integrals (ERIs) is a key component for quantum chemical methods. The intensive computation and bandwidth demand for ERI evaluation presents a significant challenge for quantum-mechanics-based atomistic simulations with hybrid density functional theory: due to the tens of trillions of ERI computations in each time step, practical applications are usually limited to thousands of atoms. In this work, we propose SERI, a high-throughput streaming accelerator for ERI computation on HBM-based FPGAs. In contrast to prior buffer-based designs, SERI proposes a novel streaming architecture to address the on-chip buffer limitation and the floorplanning challenge, and leverages the high-bandwidth memory to overcome the bandwidth bottleneck in prior designs. Moreover, to meet the varying computation, bandwidth, and floorplanning requirements between the 55 canonical quartet classes in ERI calculation, we design an automation tool, together with an accurate performance model, to automatically customize the architecture and floorplanning strategy for each canonical quartet class to maximize their throughput. Our performance evaluation on the AMD Alveo U280 FPGA board shows that, SERI achieves an average speedup of 9.80x over the previous best-performing FPGA design, a 3.21x speedup over a 64-core AMD EPYC 7713 CPU, and a 15.64x speedup over an Nvidia A40 GPU. It reaches a peak throughput of 23.8 GERIS (109 ERIs per second) on one Alveo U280 FPGA"	Philip Stachura et al.	Simon Fraser University	Paper GitHub
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs Abstract "Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats (FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integer based quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers"	Shivam Aggarwal et al.	NUS	Paper
SSDe: FPGA-Based SSD Express Emulation Framework Abstract "Solid State Drives (SSDs) are crucial for modern high-performance computing (HPC) and cloud services, providing superior speed and reliability. Two prominent approaches for mirroring SSD behavior are SSD emulation and simulation. Using specialized hardware, emulation outperforms simulation in speed and efficiency, especially in Hardware-in-the-Loop (HIL) testing and design space exploration (DSE) for cloud service systems. FPGA-based SSD emulation further enhances the accurate emulation of smartSSDs, devices consisting of SSD and FPGA components. Despite their merits, current SSD emulators encounter difficulties meeting these applications' demands. Many are limited by their underlying hardware platforms, impeding their capacity to emulate the full array of SSD behaviors. Some emulators are primarily devised for testing and validating new SSD designs rather than emulating existing, commercially-available SSDs. To address this, we introduce SSDe, an FPGA-based SSD Express emulator that accurately emulates the Non-Volatile Memory Express (NVMe) SSDs. SSDe, a next-generation SSD emulator built on top of FSSD [1] (a previous FPGA-based SSD emulator we developed), brings in additional capabilities such as energy modeling, garbage collection, and runtime reconfiguration. These features enhance its efficiency in managing diverse emulation tasks, making SSDe a more flexible and efficient emulator than both FSSD and traditional software simulators. SSDe achieves an average error of 14.1% in bandwidth and 18% in energy modeling, while its runtime parameter reconfiguration for DSE study takes less than 0.1 seconds (2.36 ×10^5 times faster than FSSD). SSDe stands as a new and robust SSD emulator, offering new opportunities for optimizing HPC and cloud infrastructures."	Yizhen Lu et al.	UIUC	Paper
SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration Abstract "With the increase in the computation intensity of the chip, the mismatch between computation layer shapes and the available computation resource significantly limits the utilization of the chip. Driven by this observation, prior works discuss spatial accelerators or dataflow architecture to maximize the throughput. However, using spatial accelerators could potentially increase the execution latency. In this work, we first systematically investigate two execution models: (1) sequentially (temporally) launch one monolithic accelerator, and (2) spatially launch multiple accelerators. From the observations, we find that there is a latency throughput tradeoff between these two execution models, and combining these two strategies together can give us a more efficient latency throughput Pareto front. To achieve this, we propose spatial sequential architecture (SSR) and SSR design automation framework to explore both strategies together when deploying deep learning inference. We use the 7nm AMD Versal ACAP VCK190 board to implement SSR accelerators for four end-to-end transformer-based deep learning models. SSR achieves average throughput gains of 2.53x, 35.71x, and 14.20x under different batch sizes compared to the 8nm Nvidia GPU A10G, 16nm AMD FPGAs ZCU102, and U250. The average energy efficiency gains are 8.51x, 6.75x, and 21.22x, respectively. Compared with the sequential-only solution and spatial-only solution on VCK190, our spatial-sequential-hybrid solutions achieve higher throughput under the same latency requirement and lower latency under the same throughput requirement. We also use SSR analytical models to demonstrate how to use SSR to optimize solutions on other computing platforms, e.g., 14nm Intel Stratix 10 NX"	Jinming Zhuang et al.	University of Pittsburgh, University of Maryland, University of Notre Dame	Paper GitHub
SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs Abstract "Efficiently supporting long context length is crucial for Transformer models. The quadratic complexity of the self-attention computation plagues traditional Transformers. Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens, reducing the theoretical complexity from quadratic to linear. Although the sparsity induced by window attention is highly structured, it does not align perfectly with the microarchitecture of the conventional accelerators, leading to suboptimal implementation. In response, we propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input. The proposed microarchitecture is based on a design that maximizes data reuse by using a combination of row-wise dataflow, kernel fusion optimization, and an input-stationary design considering the distributed memory and computation resources of FPGA. Consequently, it achieves up to 22x and 5.7x improvement in latency and energy efficiency compared to the baseline FPGA-based accelerator and 15x energy efficiency compared to GPU-based solution"	Zhenyu Bai et al.	NUS	Paper
S2TAR: Shared Secure Trusted Accelerators with Reconfiguration for Machine Learning in the Cloud Abstract "The demand for hardware accelerators such as Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) is rapidly increasing due to growing Machine Learning (ML) workloads. As with any shared computing resources, there is a growing need to dynamically adjust and scale accelerator services while ensuring data privacy and confidentiality, especially in cloud environments. We propose a secure and reconfigurable TPU design with confidential computing support, achieved through a Trusted Execution Environment (TEE) framework tailored for reconfigurable TPU in a multi-tenant cloud. Our contributions include a novel TPU design based on switchbox-enabled systolic arrays to support rapid dynamic partitioning. We evaluate our TPU design with TEEs in shared environments, achieving up to 42.1 % higher performance for realistic ML inference workloads. Our remote attestation protocol extends to sub-device partitions, providing trustworthiness on a fine-grained level and decouples host and accelerator TEEs into separate attestation reports without degrading security guarantees. Our work presents a new TEE framework for secure and reconfigurable ML accelerators in a multi-tenant cloud environment."	Wei Ren et al.	UIUC	Paper
TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs Abstract "Despite the increasing adoption of FPGAs in compute clouds, there remains a significant gap in programming tools and abstractions which can leverage network-connected, cloud-scale, multi-die FPGAs to generate accelerators with high frequency and throughput. We propose TAPA-CS, a taskparallel dataflow programming framework which automatically partitions and compiles a large design across a cluster of FPGAs while achieving high frequency and throughput. TAPA-CS has three main contributions. First, it is an open-source framework which allows users to leverage virtually "unlimited" accelerator fabric, high-bandwidth memory (HBM), and on-chip memory. Second, given as input a large design, TAPA-CS automatically partitions the design to map to multiple FPGAs, while ensuring congestion control, resource balancing, and overlapping of communication and computation. Third, TAPA-CS couples coarse-grained floorplanning with interconnect pipelining at the inter- and intraFPGA levels to ensure high frequency. FPGAs in our multiFPGA testbed communicate through a high-speed 100Gbps Ethernet infrastructure. We have evaluated the performance of TAPA-CS on designs, including systolic-array based CNNs, graph processing workloads such as page rank, stencil applications, and KNN. On average, the 2-, 3-, and 4-FPGA designs are 2.1×, 3.2×, and 4.4× faster than the single FPGA baselines generated through Vitis HLS. TAPA-CS also achieves a frequency improvement between 11%-116% compared with Vitis HLS"	Neha Prakriya et al.	UCLA	Paper
VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks Abstract "Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field"	Zhaomin Wu et al.	NUS	Paper

2023

Title & Abstract	Author(s)	Institution	Link
A Finite-Difference Time-Domain (FDTD) solver with linearly scalable performance in an FPGA Abstract "This paper presents an FPGA cluster-based Finite-Difference Time-Domain (FDTD) accelerator that offers a linear speedup with the number of FPGAs participating in computation within the cluster. FDTD is a numeric method for simulating electromagnetic wave propagation and interactions with diverse materials and structures. Recent advancements in machine learning-based design and optimization techniques for photonic integrated circuits and microwave circuits, known as inverse design, have demonstrated remarkable success. Inverse design necessitates numerous FDTD simulations, and the high-performance FDTD accelerator enables rapid design automation, which is crucial for accelerating innovation. Our proposed accelerator comprises deeply pipelined FDTD cell update kernels that can traverse multiple FPGAs via high-speed optical links, effectively utilizing available resources across all FPGAs in a cluster. The architecture includes a head node and a flexible number of cascaded server nodes, together with custom cross-FPGA data routing kernels integrated into the "Open Cloud Testbed" (OCT) FPGA infrastructure to facilitate seamless data transfer. The proposed accelerator is developed on an existing platform, OCT FPGA. Our experiments reveal that, for a 4096x4096 2.5D FDTD simulation, each server node (Xilinx Alveo U280) can achieve 86.4 Giga-cells updates per second (GCUPS), and the head node can achieve 38.4 GCUPS. The overall speed with 4 server nodes is 38.4 + 4x86.4 = 384 GCUPS"	Zhenyu Xu et al.	University of Rhode Island	Paper
A Framework to Enable Runtime Programmable P4-enabled FPGAs in the Open Cloud Testbed Abstract "This paper presents a framework for cloud users who wish to specify their experiments in the P4 language and map them to FPGAs in the Open Cloud Testbed (OCT). OCT consists of P4-enabled FPGA nodes that are directly connected to the network via 100 gigabit Ethernet connections, and which support runtime reconfiguration. Cloud users can quickly prototype and deploy their P4 applications through our framework, which provides the necessary infrastructure including a network interface shell for the P4 logic. We have provided several examples using this framework that demonstrate designs running at the 100 GbE line rate with the support of runtime reconfiguration for P4 functions. By combining an existing network interface shell and P4 toolchain on FPGAs, we offer a framework that enables users to rapidly execute their P4 experiments in real time on FPGAs"	Zhaoyang Han et al.	Northeastern University	Paper GitHub
Accelerating Garbled Circuits in the Open Cloud Testbed with Multiple Network-Attached FPGAs Abstract "Field Programmable Gate Arrays are increasingly used in cloud computing to increase the run time performance of applications. For complex applications or applications that operate over large amounts of data, users may want to use more than one FPGA. The challenge is how to map and parallelize applications to a multi-FPGA cloud computing platform such that the problem is partitioned evenly over the FPGAs, memory resources are used effectively, communication is minimized, and speedup is maximized. In this research, we build a framework to map Garbled Circuit applications, an implementation of Secure Function Evaluation, to the Open Cloud Testbed, which has FPGA cards attached to computing nodes. The FPGAs are directly connected to 100 GbE switches and can communicate directly through the network; we use the Xilinx UDP stack for this. Preprocessing generates efficient memory allocation and partitioning maps and schedules executions to different FPGAs to minimize communication and maximize processing overlap. This framework achieves close to perfect speedup on a two-FPGA setup compared to a one-FPGA implementation, and can handle large examples that cannot fit on a single FPGA"	Kai Huang et al.	Google and Northeastern University	Paper
ACTS: A Near-Memory FPGA Graph Processing Framework Abstract "Over the past decade, much progress has been made to advance the acceleration of sparse linear operators such as SpMM and SpMV on FPGAs. Nevertheless, few works have attempted to address sparse triangular solver (SpTRSV) acceleration, and the performance boost is limited. SpTRSV is an elementary linear operator for many numerical methods, such as the least-square method. These methods, among others, are widely used in various areas, such as physical simulation and signal processing. Therefore, accelerating SpTRSV is crucial. However, many challenges impede accelerating SpTRSV, including (1) resolving dependencies between elements during forward or backward substitutions, (2) random access and unbalanced workloads across memory channels due to sparsity, (3) latency incurred by off-chip memory access for large matrices or vectors, and (4) data reuse for an unpredictable data sharing pattern. To address these issues, we have designed LevelST, the first FPGA accelerator leveraging high bandwidth memory (HBM) for solving sparse triangular systems. LevelST features (1) algorithm-hardware co-design of stream-based dependency resolution with reduced off-chip data movement, (2) resource sharing that improves resource utilization to scale up the architecture, (3) index modulo scheduling to balance workload, and (4) selective data prefetching from off-chip memory. LevelST is prototyped on an AMD Xilinx U280 HBM FPGA and evaluated with 16 sparse triangular matrices. Compared with the NVIDIA V100 and RTX 3060 GPUs over the cuSPARSE library, LevelST achieves a 2.65x speedup and 9.82x higher energy efficiency than the best of the V100 GPU and RTX 3060 GPU"	Wole Jaiyeoba et al.	UCLA	Paper
AMNES: Accelerating the computation of data correlation using FPGAs Abstract "A widely used approach to characterize input data in both databases and ML is computing the correlation between attributes. The operation is supported by all major database engines and ML platforms. However, it is an expensive operation as the number of attributes involved grows. To address the issue, in this paper we introduce AMNES, a stream analytics system offloading the correlation operator into an FPGA-based network interface card. AMNES processes data at network line rate and the design can be used in combination with smart storage or SmartNICs to implement near data or in-network data processing. AMNES design goes beyond matrix multiplication and offers a customized solution for correlation computation bypassing the CPU. Our experiments show that AMNES can sustain streams arriving at 100 Gbps over an RDMA network, while requiring only ten milliseconds to compute the correlation coefficients among 64 streams, an order of magnitude better than competing CPU or GPU designs"	Monica Chiosa et al.	ETH Zurich and AMD	Paper GitHub
AutoScaleDSE: A Scalable Design Space Exploration Engine for High-Level Synthesis Abstract "High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set"	Hyegang Jun et al.	UIUC	Paper
Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver Abstract "The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94x, 3.36x higher throughput, and 2.94x better energy efficiency. Compared to an NVIDIA A100 GPU which has 4x the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34x higher energy efficiency"	Linghao Song et al.	UCLA and Ansys	Paper GitHub
CHARM: Composing Heterogeneous AcceleRators for Matrix Multiply on Versal ACAP Architecture Abstract "Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator"	Jinming Zhuang et al.	University of Pittsburgh, UCLA, UIUC, AMD	Paper GitHub
Co-design Hardware and Algorithm for Vector Search Abstract "Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce FANNS, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, FANNS automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. FANNS attains up to 23.0× and 37.2× speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5× and 7.6× speedup in median and 95th percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of FANNS lays a robust groundwork for future FPGA integration in data centers and AI supercomputers"	Wenqi Jiang et al.	ETH Zurich	Paper GitHub
Democratizing Domain-Specific Computing Abstract "Creating a programming environment and compilation flow that empowers programmers to create their own DSAs efficiently and affordably on FPGAs"	Yuze Chi et al.	UCLA	Paper
Distributed large-scale graph processing on FPGAs Abstract "This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host's file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, achieving a speedup of 13 compared to 8 and 3 respectively. Moreover, in the case of the large-scale graphs, the GPU solution fails due to memory limitations while the CPU solution achieves a speedup of 12 compared to the 26x achieved by our FPGA solution. Other state-of-the-art FPGA solutions are 28 times slower than our proposed solution. When the size of a graph limits the performance of a single FPGA device, our performance model shows that using multi-FPGAs in a distributed system can further improve the performance by about 12x. This highlights our implementation efficiency for large datasets not fitting in the on-chip memory of a hardware device"	Amin Sahebi et al.	University of Siena, University of Florence, Imperial College London	Paper GitHub
Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication Abstract "Modern HPC faces new challenges with the slowing of Moore's Law and the end of Dennard Scaling. Traditional computing architectures can no longer be expected to drive today's HPC loads, as shown by the adoption of heterogeneous system design leveraging accelerators such as GPUs and TPUs. Recently, FPGAs have become viable candidates as HPC accelerators. These devices can accelerate workloads by replicating implemented compute units to enable task parallelism, overlapping computation between and within kernels to enable pipeline parallelism, and increasing data locality by sending data directly between compute units. While many solutions for inter-FPGA communication have been presented, these proposed designs generally rely on inter-FPGA networks, unique system setups, and/or the consumption of soft logic resources on the chip. In this paper, we propose an FPGA-aware MPI runtime that avoids such shortcomings. Our MPI implementation does not use any special system setup other than plugging FPGA accelerators into PCIe slots. All communication is orchestrated by the host, utilizing the PCIe interconnect and inter-host network to implement message passing. We propose advanced designs that address data movement challenges and reduce the need for explicit data movement between the device and host (staging) in FPGA applications. We achieve up to 50% reduction in latency for point-to-point transfers compared to application-level staging"	Nicholas Contini et al.	The Ohio State University	Paper
Exploring the Versal AI Engines for Accelerating Stencil-based Atmospheric Advection Simulation Abstract "AMD Xilinx's new Versal Adaptive Compute Acceleration Platform (ACAP) is an FPGA architecture combining reconfigurable fabric with other on-chip hardened compute resources. AI engines are one of these and, by operating in a highly vectorized manner, they provide significant raw compute that is potentially beneficial for a range of workloads including HPC simulation. However, this technology is still early-on, and as yet unproven for accelerating HPC codes, with a lack of benchmarking and best practice. This paper presents an experience report, exploring porting of the Piacsek and Williams (PW) advection scheme onto the Versal ACAP, using the chip's AI engines to accelerate the compute. A stencil-based algorithm, advection is commonplace in atmospheric modelling, including several Met Office codes who initially developed this scheme. Using this algorithm as a vehicle, we explore optimal approaches for structuring AI engine compute kernels and how best to interface the AI engines with programmable logic. Evaluating performance using a VCK5000 against non-AI engine FPGA configurations on the VCK5000 and Alveo U280, as well as a 24-core Xeon Platinum Cascade Lake CPU and Nvidia V100 GPU, we found that whilst the number of channels between the fabric and AI engines are a limitation, by leveraging the ACAP we can double performance compared to an Alveo U280"	Nick Brown et al.	The University of Edinburgh	Paper
Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs Abstract "In recent years the use of FPGAs to accelerate scientific applications has grown, with numerous applications demonstrating the benefit of FPGAs for high performance workloads. However, whilst High Level Synthesis (HLS) has significantly lowered the barrier to entry in programming FPGAs by enabling programmers to use C++, a major challenge is that most often these codes are not originally written in C++. Instead, Fortran is the lingua franca of scientific computing and-so it requires a complex and time consuming initial step to convert into C++ even before considering the FPGA. In this paper we describe work enabling Fortran for AMD Xilinx FPGAs by connecting the LLVM Flang front end to AMD Xilinx's LLVM back end. This enables programmers to use Fortran as a first-class language for programming FPGAs, and as we demonstrate enjoy all the tuning and optimisation opportunities that HLS C++ provides. Furthermore, we demonstrate that certain language features of Fortran make it especially beneficial for programming FPGAs compared to C++. The result of this work is a lowering of the barrier to entry in using FPGAs for scientific computing, enabling programmers to leverage their existing codebase and language of choice on the FPGA directly"	Gabriel Rodriguez-Canal et al.	The University of Edinburgh	Paper GitHub
FSSD: FPGA-Based Emulator for SSDs Abstract "Solid State Drives (SSDs) have become increasingly popular due to their superior access latency and bandwidth compared to Hard Disk Drives (HDDs). However, to fully understand the impact of SSD design and microarchitecture on end-to-end application performance, researchers need to move beyond treating SSDs as black-box components. Unfortunately, purchasing multiple SSDs for research is expensive and ineffective since the underlying microarchitecture is still unknown to the system designer. While simulators have become the most popular method for studying SSDs, existing software-based simulators lack real data transfers and cannot simulate the latency from NVMe and PCIe interfaces. Additionally, simulating the entire SSD using software codes is time-consuming and limits the number of experiments that can be run in a reasonable amount of time. To address these issues, we present FSSD, an FPGA-based emulation system that models the latency and access patterns of an actual NVMe SSD. FSSD takes advantage of the flexibility of an FPGA, enabling users to customize SSD microarchitecture features and explore the design space for data-intensive applications. FSSD can be interacting with real operating systems, instead of relying on Virtual Machines like most other software simulators do. Evaluations show that FSSD provides over 1000x speedup compared to software-based simulation using the SimpleSSD simulator. The ability to customize SSD parameters and emulate NAND latency with high precision makes FSSD a valuable platform for SSD research and development. FSSD is also open-sourced to benefit the research community"	Luyang Yu et al.	UIUC	Paper
GNNHLS: Evaluating Graph Neural Network Inference via High-Level Synthesis Abstract "With the ever-growing popularity of Graph Neural Networks (GNNs), efficient GNN inference is gaining tremendous attention. Field-Programming Gate Arrays (FPGAs) are a promising execution platform due to their fine-grained parallelism, low-power consumption, reconfigurability, and concurrent execution. Even better, High-Level Synthesis (HLS) tools bridge the gap between the non-trivial FPGA development efforts and rapid emergence of new GNN models. In this paper, we propose GNNHLS, an open-source framework to comprehensively evaluate GNN inference acceleration on FPGAs via HLS, containing a software stack for data generation and baseline deployment, and FPGA implementations of 6 well-tuned GNN HLS kernels. We evaluate GNNHLS on 4 graph datasets with distinct topologies and scales. The results show that GNNHLS achieves up to 50.8x speedup and 423x energy reduction relative to the CPU baselines. Compared with the GPU baselines, GNNHLS achieves up to 5.16x speedup and 74.5x energy reduction"	Chenfeng Zhao et al.	Washington University in St. Louis	Paper
High Performance, Low Power Matrix Multiply Design on ACAP: from Architecture, Design Challenges and DSE Perspectives Abstract "As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250, 2.32x (FP32) than Nvidia Jetson TX2, 1.06x (FP32), 1.70x (INT8) than Nvidia A100"	Jinming Zhuang et al.	University of Pittsburgh	Paper
LightRW: FPGA Accelerated Graph Dynamic Random Walks Abstract "Graph dynamic random walks (GDRWs) have recently emerged as a powerful paradigm for graph analytics and learning applications, including graph embedding and graph neural networks. Despite the fact that many existing studies optimize the performance of GDRWs on multi-core CPUs, massive random memory accesses and costly synchronizations cause severe resource under utilization, and the processing of GDRWs is usually the key performance bottleneck in many graph applications. This paper studies an alternative architecture, FPGA, to address these issues in GDRWs, as FPGA has the ability of hardware customization so that we are able to explore fine-grained pipeline execution and specialized memory access optimizations. Specifically, we propose LightRW, a novel FPGA-based accelerator for GDRWs. LightRW embraces a series of optimizations to enable fine-grained pipeline execution on the chip and to exploit the massive parallelism of FPGA while significantly reducing memory accesses. As current commonly used sampling methods in GDRWs do not efficiently support fine-grained pipeline execution, we develop a parallelized reservoir sampling method to sample multiple vertices per cycle for efficient pipeline execution. To address the random memory access issues, we propose a degree-aware configurable caching method that buffers hot vertices on-chip to alleviate random memory accesses and a dynamic burst access engine that efficiently retrieves neighbors. Experimental results show that our optimization techniques are able to improve the performance of GDRWs on FPGA significantly. Moreover, LightRW delivers up to 9.55x and 9.10x speedup over the state-of-the-art CPU-based MetaPath and Node2vec random walks, respectively"	Hongshi Tan et al.	NUS	Paper GitHub
Machine Learning Across Network-Connected FPGAs Abstract "FPGAs often cannot implement machine learning inference with high accuracy models due to significant storage and computing requirements. The corresponding hardware accelerators of such models are large designs which cannot be deployed on a single platform. In this research, we implement ResNet-50 with 4 bit precision for weights and 5 bit precision for activations, which has a good trade-off between precision and accuracy. We train ResNet-50 using the quantization-aware training library Brevitas and build a hardware accelerator with the FINN framework from AMD. We map the result to three FPGAs that communicate directly with one another over the network via the User Datagram Protocol (UDP). The multi-FPGA implementation is compared to a single FPGA ResNet-50 design with lower precision of 1 bit weights and 2 bit activations. While the latter can fit on a single FPGA, the former pays for higher accuracy with a three times increase in the required number of BRAM tiles and can only be deployed on multiple FPGAs. We show the difference in accuracy, resource utilization, and throughput for the designs deployed on AMD Alveo U280 data center accelerator cards available in the Open Cloud Testbed images/s"	Dana Diaconu et al.	Northeastern University	Paper
MESA: Microarchitecture Extensions for Spatial Architecture Generation Abstract "Modern heterogeneous CPUs incorporate hardware accelerators to enable domain-specialized execution and achieve improved efficiency. A well-known class among them, spatial accelerators, are designed with reconfigurability to accelerate a wide range of compute-heavy and data-parallel applications. Unlike CPU cores, however, they tend to require specialized compilers and software stacks, libraries, or languages to operate and cannot be utilized with ease by all applications. As a result, the accelerator's large pool of compute and memory resources sit wastefully idle when it is not explicitly programmed. Our goal is to dismantle this CPU-accelerator barrier by monitoring CPU threads for acceleration opportunities during execution and, if viable, dynamically reconfigure the accelerator to allow transparent offloading. We develop MESA (Microarchitecture Extensions for Spatial Architecture Generation), a hardware block on the CPU that translates machine code to build an accelerator configuration specialized for the running program. While such a dynamic translation/reconfiguration approach is challenging, it has a key advantage over ahead-of-time compilers: access to runtime information, revealing not only dynamic dependencies but also performance characteristics. MESA maintains a real-time performance model of the program mapped on the accelerator in the form of a spatial dataflow graph with nodes weighted by operation latency and edges weighted by data transfer latency. Features of this dataflow graph are continuously updated with runtime information captured by performance counters, allowing a feedback loop of optimization, reconfiguration, and acceleration. This performance model allows MESA to identify the accelerator's critical paths and pinpoint its bottlenecks, upon which we implement in hardware a data-driven instruction mapping algorithm that locally minimizes latency. Backed by a synthesized RTL implementation, we evaluate the feasibility of our microarchitectural solution with different accelerator configurations. Across the Rodinia benchmarks, results demonstrate an average 1.3× speedup in performance and 1.8× gain in energy efficiency against a multicore CPU baseline"	Dong Kai Wang et al.	UIUC	Paper GitHub
MorphStream: Adaptive Scheduling for Scalable Transactional Stream Processing on Multicores Abstract "Transactional stream processing engines (TSPEs) differ significantly in their designs, but all rely on non- adaptive scheduling strategies for processing concurrent state transactions. Subsequently, none exploit multicore parallelism to its full potential due to complex workload dependencies. This paper introduces MorphStream, which adopts a novel approach by decomposing scheduling strategies into three dimensions and then strives to make the right decision along each dimension, based on analyzing the decision trade-offs under varying workload characteristics. Compared to the state-of-the-art, MorphStream achieves up to 3.4 times higher throughput and 69.1% lower processing latency for handling real-world use cases with complex and dynamically changing workload dependencies"	Yancan Mao et al.	ETH Zurich	Paper GitHub
Multi-FPGA Designs and Scaling of HPC Challenge Benchmarks via MPI and Circuit-Switched Inter-FPGA Networks Abstract "Extension of the HPCC benchmark suite for FPGAs with multi-FPGA benchmarks and support of inter-FPGA communication"	Marius Meyer et al.	Paderborn University	Paper GitHub
Nimblock: Scheduling for Fine-grained FPGA Sharing through Virtualization Abstract "As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate finegrained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores preemption as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 5.7× lower average response times when compared to a no-sharing and no-virtualization scheduling algorithm and up to 2.1× average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6× lower tail response times when compared to other high-performance algorithms"	Meghna Mandava et al.	UIUC	Paper
Optimistic Data Parallelism for FPGA-Accelerated Sketching Abstract "Sketches are a popular approximation technique for large datasets and high-velocity data streams. While custom FPGA-based hardware has shown admirable throughput at sketching, the state-of-the-art exploits data parallelism by fully replicating resources and constructing independent summaries for every parallel input value. We consider this approach pessimistic, as it guarantees constant processing rates by provisioning resources for the worst case. We propose a novel optimistic sketching architecture for FPGAs that partitions a single sketch into multiple independent banks shared among all input values, thus significantly reducing resource consumption. However, skewed input data distributions can result in conflicting accesses to banks and impair the processing rate. To mitigate the effect of skew, we add mergers that exploit temporal locality by combining recent updates.Our evaluation shows that an optimistic architecture is feasible and reduces the utilization of critical FPGA resources proportionally to the number of parallel input values. We further show that FPGA accelerators provide up to 2.6𝑥 higher throughput than a recent CPU and GPU, while larger sketch sizes enabled by optimistic architectures improve accuracy by up to an order of magnitude in a realistic sketching application"	Martin Kiefer et al.	TU Berlin, DFKI, ITU Copenhagen	Paper GitHub
PASNet: Polynomial Architecture Search Framework for Two-party Computation-based Secure Neural Network Deployment Abstract "Two-party computation (2PC) is promising to enable privacy-preserving deep learning (DL). However, the 2PCbased privacy-preserving DL implementation comes with high comparison protocol overhead from the non-linear operators. This work presents PASNet, a novel systematic framework that enables low latency, high energy efficiency & accuracy, and security-guaranteed 2PC-DL by integrating the hardware latency of the cryptographic building block into the neural architecture search loss function. We develop a cryptographic hardware scheduler and the corresponding performance model for Field Programmable Gate Arrays (FPGA) as a case study. The experimental results demonstrate that our light-weighted model PASNet-A and heavily-weighted model PASNet-B achieve 63 ms and 228 ms latency on private inference on ImageNet, which are 147 and 40 times faster than the SOTA CryptGPU system, and achieve 70.54% & 78.79% accuracy and more than 1000 times higher energy efficiency"	Hongwu Peng et al.	University of Connecticut	Paper GitHub
Serverless FPGA: Work-In-Progress Abstract "In this short paper we investigate the combination of two emerging technologies: the tight provisioning requirements of Serverless computing and the acceleration potential of FPGAs. Serverless platforms suffer from container overheads, notably cold start latency, while having to adapt to Function-as-a-Service (FaaS) workloads. By exploring re-configurability of FPGAs and their acceleration power, we propose an innovative light-weight Serverless platform for FPGA-based FaaS applications which aims to reduce these overheads. In this study, we explore the feasibility of the idea by implementing key elements of such platform onto the FPGA. Our initial results show potential for acceleration in all aspects of function invocation"	Fabio Maschi et al.	ETH Zurich	Paper
SSDe: FPGA-Based SSD Express Emulation Framework Abstract "Solid State Drives (SSDs) are crucial for modern high-performance computing (HPC) and cloud services, providing superior speed and reliability. Two prominent approaches for mirroring SSD behavior are SSD emulation and simulation. Using specialized hardware, emulation outperforms simulation in speed and efficiency, especially in Hardware-in-the-Loop (HIL) testing and design space exploration (DSE) for cloud service systems. FPGA-based SSD emulation further enhances the accurate emulation of smartSSDs, devices consisting of SSD and FPGA components. Despite their merits, current SSD emulators encounter difficulties meeting these applications' demands. Many are limited by their underlying hardware platforms, impeding their capacity to emulate the full array of SSD behaviors. Some emulators are primarily devised for testing and validating new SSD designs rather than emulating existing, commercially-available SSDs. To address this, we introduce SSDe, an FPGA-based SSD Express emulator that accurately emulates the Non-Volatile Memory Express (NVMe) SSDs. SSDe, a next-generation SSD emulator built on top of FSSD [1] (a previous FPGA-based SSD emulator we developed), brings in additional capabilities such as energy modeling, garbage collection, and runtime reconfiguration. These features enhance its efficiency in managing diverse emulation tasks, making SSDe a more flexible and efficient emulator than both FSSD and traditional software simulators. SSDe achieves an average error of 14.1% in bandwidth and 18% in energy modeling, while its runtime parameter reconfiguration for DSE study takes less than 0.1 seconds (2.36 ×105 times faster than FSSD). SSDe stands as a new and robust SSD emulator, offering new opportunities for optimizing HPC and cloud infrastructures"	Yizhen Lu et al.	UIUC	Paper
The Difficult Balance Between Modern Hardware and Conventional CPUs Abstract "Research has demonstrated the potential of accelerators in a wide range of use cases. However, there is a growing imbalance between modern hardware and the CPUs that submit the workload. Recent studies of GPUs on real systems have shown that many servers are often needed per accelerator to generate a high enough load so the computing power is leveraged. This fact is often ignored in research, although it often determines the actual feasibility and overall efficiency of a deployment. In this paper, we conduct a detailed study of the possible configurations and overall cost efficiency of deploying an FPGA-based accelerator on a commercial search engine. First, we show that there are many possible configurations balancing the upstream system and the way the accelerator is configured. Of these configurations, not all of them are suitable in practice, even if they provide some of the highest throughput. Second, we analyse the cost of a deployment capable of sustaining the required workload of the commercial search engine. We examine deployments both on-premises and in the cloud with and without FPGAs and with different board models. The results show that, while FPGAs have the potential to significantly improve overall performance, the performance imbalance between their host CPUs and the FPGAs can make the deployments economically unattractive. These findings are intended to inform the development and deployment of accelerators by showing what is needed on the CPU side to make them effective and also to provide important insights into their end-to-end integration within existing systems"	Fabio Maschi et al.	ETH Zurich	Paper

2022

Title & Abstract	Author(s)	Institution	Link
Accelerating SSSP for Power-Law Graphs Abstract The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many application domains, such as road navigation, neural image reconstruction, and social network analysis. Although we have known various SSSP algorithms for decades, implementing one for large scale power-law graphs efficiently is still highly challenging today, because ① a work-efficient SSSP algorithm requires priority-order traversal of graph data, ② the priority queue needs to be scalable both in throughput and capacity, and ③ priority-order traversal requires extensive random memory accesses on graph data. In this paper, we present SPLAG to accelerate SSSP for powerlaw graphs on FPGAs. SPLAG uses a coarse-grained priority queue (CGPQ) to enable high-throughput priority-order graph traversal with a large frontier. To mitigate the high-volume random accesses, SPLAG employs a customized vertex cache (CVC) to reduce off-chip memory access and improve the throughput to read and update vertex data. Experimental results on various synthetic and real world datasets show up to a 4.9× speedup over state-of-the-art SSSP accelerators, a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and a 0.9× speedup over an A100 GPU that has 4.1× power budget and 3.4× HBM bandwidth. Such a high performance would place SPLAG in the 14th position of the Graph 500 benchmark for data intensive applications (the highest using a single FPGA) with only a 45 W power budget. SPLAG is written in high-level synthesis C++ and is fully parameterized, which means it can be easily ported to various different FPGAs with different configurations.	Yuze Chi et al.	UCLA	Paper
AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators Abstract Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. To address this problem, we propose an automated DSE framework—AutoDSE—that leverages a bottleneck-guided coordinate optimizer to systematically find a better design point. AutoDSE detects the bottleneck of the design in each step and focuses on high-impact parameters to overcome it. The experimental results show that AutoDSE is able to identify the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for MachSuite and Rodinia benchmarks. Compared to the manually optimized HLS vision kernels in Xilinx Vitis libraries, AutoDSE can reduce their optimization pragmas by 26.38× while achieving similar performance. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers to design efficient FPGA accelerators.	Atefeh Sohrabizadeh et al.	UCLA	Paper
Automated Accelerator Optimization Aided by Graph Neural Networks Abstract Using High-Level Synthesis (HLS), the hardware designers must describe only a high-level behavioral flow of the design. However, it still can take weeks to develop a high-performance architecture mainly because there are many design choices at a higher level to explore. Besides, it takes several minutes to hours to evaluate the design with the HLS tool. To solve this problem, we model the HLS tool with a graph neural network that is trained to be used for a wide range of applications. The experimental results demonstrate that our model can estimate the quality of design in milliseconds with high accuracy, resulting in up to 79X speedup (with an average of 48X) for optimizing the design compared to the previous state-of-the-art work relying on the HLS tool.	Atefeh Sohrabizadeh et al.	UCLA	Paper
Enzian: An Open, General, CPU/FPGA Platform for Systems Software Research Abstract Hybrid computing platforms, comprising CPU cores and FPGA logic, are increasingly used for accelerating data-intensive workloads in cloud deployments, and are a growing topic of interest in systems research. However, from a research perspective, existing hardware platforms are limited: they are often optimized for concrete, narrow use-cases and, therefore lack the flexibility needed to explore other applications and configurations. We show that a research group can design and build a more general, open, and affordable hardware platform for hybrid systems research. The platform, Enzian, is capable of duplicating the functionality of existing CPU/FPGA systems with comparable performance but in an open, flexible system. It couples a large FPGA with a server-class CPU in an asymmetric cache-coherent NUMA system. Enzian also enables research not possible with existing hybrid platforms, through explicit access to coherence messages, extensive thermal and power instrumentation, and an open, programmable baseboard management processor. Enzian is already being used in multiple projects, is open source (both hardware and software), and available for remote use. We present the design principles of Enzian, the challenges in building it, and evaluate it with a range of existing research use-cases alongside other, more specialized platforms, as well as demonstrating research not possible on existing platforms.	David Cock et al.	ETH Zurich	Paper
Farview: Disaggregated Memory with Operator Off-loading for Database Engines Abstract Cloud deployments disaggregate storage from compute, providing more flexibility to both the storage and compute layers. In this paper, we explore disaggregation by taking it one step further and applying it to memory (DRAM). Disaggregated memory uses network attached DRAM as a way to decouple memory from CPU. In the context of databases, such a design offers significant advantages in terms of making a larger memory capacity available as a central pool to a collection of smaller processing nodes. To explore these possibilities, we have implemented Farview, a disaggregated memory solution for databases, operating as a remote buffer cache with operator offloading capabilities. Farview is implemented as an FPGA-based smart NIC making DRAM available as a disaggregated, network attached memory pool capable of performing data processing at line rate over data streams to/from disaggregated memory. Farview supports query offloading using operators such as selection, projection, aggregation, regular expression matching and encryption. In this paper we focus on analytical queries and demonstrate the viability of the idea through an extensive experimental evaluation of Farview under different workloads. Farview is competitive with a local buffer cache solution for all the workloads and outperforms it in a number of cases, proving that a smart disaggregated memory can be a viable alternative for databases deployed in cloud environments.	Dario Korolija et al.	ETH Zurich	Paper
FlexCNN: An End-to-End Framework for Composing CNN Accelerators on FPGA Abstract With reduced data reuse and parallelism, recent convolutional neural networks (CNNs) create new challenges for FPGA acceleration. Systolic arrays (SAs) are efficient, scalable architectures for convolutional layers, but without proper optimizations, their efficiency drops dramatically for reasons: (1) the different dimensions within same-type layers, (2) the different convolution layers especially transposed and dilated convolutions, and (3) CNN’s complex dataflow graph. Furthermore, significant overheads arise when integrating FPGAs into machine learning frameworks. Therefore, we present a flexible, composable architecture called FlexCNN, which delivers high computation efficiency by employing dynamic tiling, layer fusion, and data layout optimizations. Additionally, we implement a novel versatile SA to process normal, transposed, and dilated convolutions efficiently. FlexCNN also uses a fully pipelined software-hardware integration that alleviates the software overheads. Moreover, with an automated compilation flow, FlexCNN takes a CNN in the ONNX representation, performs a design space exploration, and generates an FPGA accelerator. The framework is tested using three complex CNNs: OpenPose, U-Net, and E-Net. The architecture optimizations achieve 2.3× performance improvement. Compared to a standard SA, the versatile SA achieves close-to-ideal speedups, with up to 5.98× and 13.42× for transposed and dilated convolutions, with a 6% average area overhead. The pipelined integration leads to a 5× speedup for OpenPose.	Suhail Basalam et al.	UCLA	Paper
FPGA Acceleration of Pre-Alignment Filters for Short Read Mapping With HLS Abstract Pre-alignment filters are useful for reducing the computational requirements of genomic sequence mappers. Most of them are based on estimating or computing the edit distance between sequences and their candidate locations in a reference genome using a subset of the dynamic programming table used to compute Levenshtein distance. Some of their FPGA implementations of use classic HDL toolchains, thus limiting their portability. Currently, most FPGA accelerators offered by heterogeneous cloud providers support C/C++ HLS. This work implements and optimizes several state-of-the-art pre-alignment filters using C/C++ based-HLS to expand their portability to a wide range of systems supporting the OpenCL runtime. A complete analysis of the performance and accuracy is performed. The maximum throughput obtained by an exact filter is 95.1 MPairs/s including memory transfers using 100 bp sequences, which is the highest ever reported for a comparable system and more than two times faster than previous HDL-based results. The best energy efficiency obtained from the accelerator (not considering host CPU) is 2.1 MPairs/J, more than one order of magnitude higher than other accelerator-based comparable approaches from the state of the art.	David Castells-Rufas et al.	Universitat Autònoma de Barcelona	Paper
FPGA Acceleration of Probabilistic Sentential Decision Diagrams with High-Level Synthesis Abstract Probabilistic Sentential Decision Diagrams (PSDDs) provide efficient methods for modeling and reasoning with probability distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such as Bayesian networks (BNs) therefore offering a new set of tools for performing inference on these models (in time linear in the PSDD size). Despite these favorable characteristics of PSDDs, we have found multiple challenges in PSDD’s FPGA acceleration. Problems include limited parallelism, data dependency, and small pipeline iterations. In this article, we propose several optimization techniques to solve these issues with novel pipeline scheduling and parallelization schemes. We designed the PSDD kernel with a high-level synthesis (HLS) tool for ease of implementation and verified it on the Xilinx Alveo U250 board. Experimental results show that our methods improve the baseline FPGA HLS implementation performance by 2,200X and the multicore CPU implementation by 20X. The proposed design also outperforms state-of-the-art BN and Sum Product Network (SPN) accelerators that store the graph information in memory.	Young-kyu Cho et al.	UCLA	Paper
FPGA HLS Today: Successes, Challenges, and Opportunities Abstract The year 2011 marked an important transition for FPGA high-level synthesis (HLS), as it went from prototyping to deployment. A decade later, in this article, we assess the progress of the deployment of HLS technology and highlight the successes in several application domains, including deep learning, video transcoding, graph processing, and genome sequencing. We also discuss the challenges faced by today’s HLS technology and the opportunities for further research and development, especially in the areas of achieving high clock frequency, coping with complex pragmas and system integration, legacy code transformation, building on open source HLS infrastructures, supporting domain-specific languages, and standardization. It is our hope that this article will inspire more research on FPGA HLS and bring it to a new height.	Jason Cong et al.	UCLA	Paper
FPGA Implementation of N-BEATS for Time Series Forecasting using Block Minifloat Arithmetic Abstract The block minifloat (BM) number format uses an 8-bit floating point format with additional shared exponent bias to enable low-precision representation with large dynamic range. While it has been shown that the BM format can support low-precision training of convolutional neural networks such as ResNet on ImageNet at precisions down to 6 bits, its applicability to inference-only applications has not been studied. We present a BM implementation of N-BEATS, a deep neural architecture for univariate time series forecasting. N-BEATS utilises residual and fully connected (FC) blocks to achieve high accuracy. It was found that 8-bit BM had similar area and speed as 8-bit integer arithmetic with NBEATS accuracy similar to 16-bit floating point.	Wenjie Zho et al.	UCLA	Paper
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption Abstract Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to drastically change privacy considerations in the cloud, but high computational and memory overheads are preventing its broad adoption. TFHE is a promising Torus-based FHE scheme that heavily relies on bootstrapping, the noise-removal tool that must be invoked after every encrypted gate computation. We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to heavily exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations using floating-point or integer FFTs. FPT's microarchitecture is built as a streaming processor inspired by traditional streaming DSPs: it instantiates high-throughput computational stages that are directly cascaded, with simplified control logic and routing networks. We explore different throughput-balanced compositions of streaming kernels with a user-configurable streaming width in order to construct a full bootstrapping pipeline. FPT's streaming approach allows 100% utilization of arithmetic units and requires only small bootstrapping key caches, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35 us. This is in stark contrast to the established classical CPU approach to FHE bootstrapping acceleration, which tends to be heavily memory and bandwidth-constrained. FPT is fully implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves almost three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5x higher throughput compared to recent ASIC emulation experiments.	Van Beirendonck et al.	COSIC KU LEUVEN	Paper
In-depth FPGA accelerator performance evaluation with single node benchmarks from the HPC challenge benchmark suite for Intel and Xilinx FPGAs using OpenCL Abstract In-depth evaluation of the HPCC benchmark suite for FPGAs. We look into the power consumption and efficiency of the benchmarks. Also, we evaluate the impact of different floating-point precisions on the performance and resource utilization and give an example how the benchmarks can be used to evaluate the behavior of the underlying runtime environments.	Marius Meyer et al.	Paderborn University	Paper GitHub
Network Attached FPGAs in the Open Cloud Testbed (OCT) Abstract The Open Cloud Testbed (OCT) provides nodes with Field Programmable Gate Arrays (FPGAs) that are under the complete control of the user and are directly attached to a network switch via two 100Gbps connections. We provide TCP and UDP stacks on the FPGAs. In addition, users have the ability to experiment with their own protocol. We present several experiments which make use of this capability including TCP throughput measurements, an encryption/decryption example, and machine learning inference split across two FPGAs where the images are input on one node and the labelled output available on a second node. The testbed is available for researchers to perform their own experiments, and includes a development platform that allows users to create FPGA applications. Network measurement results show we achieve close to peak bandwidth by tuning appropriate parameters.	Suranga Handagala et al.	Northeastern University	Paper
OverGen: Improving FPGA Usability through Domain-specific Overlay Generation Abstract FPGAs have been proven to be powerful computational accelerators across many types of workloads. The mainstream programming approach is high level synthesis (HLS), which maps high-level languages (e.g. C+ #pragmas) to hardware. Unfortunately, HLS leaves a significant programmability gap in terms of reconfigurability, customization and versatility: Although HLS compilation is fast, the downstream physical design takes hours to days; FPGA reconfiguration time limits the time-multiplexing ability of hardware, and tools do not reason about cross-workload flexibility. Overlay architectures mitigate the above by mapping a programmable design (e.g. CPU, GPU, etc.) on top of FPGAs. However, the abstraction gap between overlay and FPGA leads to low efficiency/utilization. Our essential idea is to develop a hardware generation framework targeting a highly-customizable overlay, so that the abstraction gap can be lowered by tuning the design instance to applications of interest. We leverage and extend prior work on customizable spatial architectures, SoC generation, accelerator compilers, and design space explorers to create an end-to-end FPGA acceleration system. Our novel techniques address inefficient networks between on-chip memories and processing elements, as well as improving DSE by reducing the amount of recompilation required. Our framework, OverGen, is highly competitive with fixed-function HLS-based designs, even though the generated designs are programmable with fast reconfiguration. We compared to a state-of-the-art DSE-based HLS framework, AutoDSE. Without kernel-tuning for AutoDSE, OverGen gets 1.2× geomean performance, and even with manual kernel-tuning for the baseline, OverGen still gets 0.55× geomean performance--all while providing runtime flexibility across workloads.	Sihao Li et al.	UCLA	Paper
Pyxis: An Open-Source Performance Dataset of Sparse Accelerators Abstract Customized accelerators provide gains of performance and efficiency in specific domains of applications. Sparse data structures and/or representations exist in a wide range of applications. However, it is challenging to design accelerators for sparse applications because no architecture or performance-level analytic models are able to fully capture the spectrum of the sparse data. Accelerator researchers rely on real execution to get precise feedback for their designs. In this work, we present PYXIS, a performance dataset for customized accelerators on sparse data. PYXIS collects accelerator designs and real execution performance statistics. Currently, there are 73.8 K instances in PYXIS. PYXIS is open-source, and we are constantly growing PYXIS with new accelerator designs and performance statistics. PYXIS can be a benefit to researchers in the fields of accelerator, architecture, performance, algorithm and many related topics.	Linghao Song et al.	UCLA	Paper
RapidStream: Parallel Physical Implementation of FPGA HLS Designs Abstract FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing then stitch the separate partitions together. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physicalintegrated compilation framework that takes in an HLS dataflow program in C/C++ and generates a fully placed and routed implementation. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7× reduction in compile time and up to 1.3× increase in frequency when compared to a commercial-off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in the cases with lower performance requirements.	Licheng Guo et al.	UCLA	Paper
ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines Abstract Proposes a resource-efficient heterogeneous pipeline architecture. This heterogeneous architecture comprises of two types of pipelines: Little pipelines to process dense partitions with good locality and Big pipelines to process sparse partitions with the extremely poor locality. Unlike traditional monolithic pipeline designs, the heterogeneous pipelines are tailored for more specific memory access patterns, and hence are more lightweight, allowing the architecture to scale up to more effectively with limited resources. In addition, an automatic method generates the most efficient pipeline combination and balances workloads. Furthermore, ReGraph is an automated open-source framework. ReGraph outperforms state-of-the-art FPGA accelerators by up to 5.9 times in terms of performance and 12 times in terms of resource efficiency.	Xinyu Che et al.	NUS	Paper
Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication Abstract Sparse matrix-vector multiplication (SpMV) multiplies a sparse matrix with a dense vector. SpMV plays a crucial role in many applications, from graph analytics to deep learning. The random memory accesses of the sparse matrix make accelerator design challenging. However, high bandwidth memory (HBM) based FPGAs are a good fit for designing accelerators for SpMV. In this paper, we present Serpens, an HBM based accelerator for general-purpose SpMV, which features memory-centric processing engines and index coalescing to support the efficient processing of arbitrary SpMVs. From the evaluation of twelve large-size matrices, Serpens is 1.91x and 1.76x better in terms of geomean throughput than the latest accelerators GraphLiLy and Sextans, respectively. We also evaluate 2,519 SuiteSparse matrices, and Serpens achieves 2.10x higher throughput than a K80 GPU. For the energy/bandwidth efficiency, Serpens is 1.71x/1.99x, 1.90x/2.69x, and 6.25x/4.06x better compared with GraphLily, Sextans, and K80, respectively. After scaling up to 24 HBM channels, Serpens achieves up to 60.55 GFLOP/s (30,204 MTEPS) and up to 3.79x over GraphLily.	Linghao Song et al.	UCLA	Paper GitHub
Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication Abstract Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges – (1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) a non-general-purpose accelerator design where one accelerator can only process a fixed-size problem. In this paper, we present Sextans, an accelerator for general purpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to offchip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth competitive to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80).	Linghao Song et al.	UCLA	Paper
Shuhai: A Tool for Benchmarking High Bandwidth Memory on FPGAs Abstract FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425 GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. As a yardstick, we also apply Shuhai to DDR4 to show the differences between HBM and DDR4. Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will make Shuhai open-source, benefiting the community.	Zeke Wang et al.	Zhejiang University	Paper
StreamGCN: Accelerating Graph Convolutional Networks with Streaming Processing Abstract While there have been many studies on hardware acceleration for deep learning on images, there has been a rather limited focus on accelerating deep learning applications involving graphs. The unique characteristics of graphs, such as the irregular memory access and dynamic parallelism, impose several challenges when the algorithm is mapped to a CPU or GPU. To address these challenges while exploiting all the available sparsity, we propose a flexible architecture called StreamGCN for accelerating Graph Convolutional Networks (GCN), the core computation unit in deep learning algorithms on graphs. The architecture is specialized for streaming processing of many small graphs for graph search and similarity computation. The experimental results demonstrate that StreamGCN can deliver a high speedup compared to a multi-core CPU and a GPU implementation, showing the efficiency of our design.	Atefeh Sohrabizade et al.	UCLA	Paper
ThunderGP: Resource-Efficient Graph ProcessingFramework on FPGAs with HLS Abstract ThunderGP, an HLS-based graph processing framework on FPGAs, with which developers could enjoy FPGA-accelerated graph processing with no prior knowledge of hardware design. ThunderGP adopts the gather-apply-scatter (GAS) model as the abstraction of various graph algorithms and realizes the model by a build-in highly parallel and memory-efficient accelerator template. ThunderGP on DRAM-based hardware platforms provides 1.9 × ∼ 5.2 × improvement on bandwidth efficiency over the state-of-the-art, while ThunderGP on HBM-based hardware platforms delivers up to 5.2 × speedup over the state-of-the-art RTL-based approach.	Xinyu Che et al.	NUS	Paper GitHub
TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-Based FPGAs Abstract The emergence of high-bandwidth memory (HBM) brings new opportunities to boost the performance of sorting acceleration on FPGAs, which was conventionally bounded by the available off-chip memory bandwidth. However, it is nontrivial for designers to fully utilize this immense bandwidth. First, the existing sorter designs cannot be directly scaled at the increasing rate of available off-chip bandwidth, as the required on-chip resource usage grows at a much faster rate and would bound the sorting performance in turn. Second, designers need an in-depth understanding of HBM's characteristics to effectively utilize the HBM bandwidth. To tackle these challenges, we present TopSort, a novel two-phase sorting solution optimized for HBM-based FPGAs. In the first phase, 16 merge trees work in parallel to fully utilize 32 HBM channels’ bandwidth. In the second phase, TopSort reuses the logic from phase one to form a wider merge tree to merge the partially sorted results from phase one. TopSort also adopts HBM-specific optimizations to reduce resource overhead and improve bandwidth utilization. TopSort can sort up to 4 GB data using all 32 HBM channels, with an overall sorting performance of 15.6 GB/s. TopSort is 6.7× and 2.7× faster than state-of-the-art CPU and FPGA sorters.	Weikang Qia et al.	UCLA	Paper

2021

Title & Abstract	Author(s)	Institution	Link
ACCL: FPGA-Accelerated Collectives over 100 Gbps TCP-IP Abstract ACCL is a Vitis kernel and associated Pynq and XRT drivers which together provide MPI-like collectives for AMD FPGAs. ACCL is designed to enable compute kernels resident in FPGA fabric to communicate directly under host supervision but without requiring data movement between the FPGA and host. Instead, ACCL uses Vitis-compatible TCP and UDP stacks to connect FPGAs directly over Ethernet at up to 100 Gbps on Alveo cards.	Zhenhao He et al.	AMD Research Labs	Paper GitHub
AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs Abstract Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable frequency between an HLS design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. This problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs. To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. AutoBridge can be integrated into the existing CAD toolflow for AMD FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average.	Licheng Guo et al.	UCLA	Paper GitHub
AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA Abstract While systolic array architectures have the potential to deliver tremendous performance, it is notoriously challenging to customize an efficient systolic array processor for a target application. Designing systolic arrays requires knowledge for both high-level characteristics of the application and low-level hardware details, thus making it a demanding and inefficient process. To relieve users from the manual iterative trial-and-error process, we present AutoSA, an end-to-end compilation framework for generating systolic arrays on FPGA. AutoSA is based on the polyhedral framework, and further incorporates a set of optimizations on different dimensions to boost performance. An efficient and comprehensive design space exploration is performed to search for high-performance designs. We have demonstrated AutoSA on a wide range of applications, on which AutoSA achieves high performance within a short amount of time. As an example, for matrix multiplication, AutoSA achieves 934 GFLOPs, 3.41 TOPs, and 6.95 TOPs in floating point, 16-bit and 8-bit integer data types on Alveo Alveo U250.	Jie Wang et al.	UCLA	Paper
Distributed Recommendation Inference on FPGA Clusters Abstract Implementation of an efficient distributed recommendation inference on an FPGA cluster that optimizes both the memory-bound embedding layer and the computation-bound fully-connected layers. The system achieves a maximum speed up of 28.95x, while guaranteeing very low latency.	Yu Zhu et al.	ETH Zurich	Paper
EasyNet: 100 Gbps Network for HLS Abstract Integration of an open-source 100 Gbps TCP/IP stack into Vitis without degrading its performance. A set of MPI-like communication primitives are provided to abstract away low level details of the networking stack.	Zhenhao He et al.	ETH Zurich	Paper GitHub
Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning Abstract Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement.	Tobias Alonso et al.	AMD Research Labs	Paper
Extending High-Level Synthesis for Task-Parallel Programs Abstract C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other	Yuze Chi et al.	UCLA	Paper

2020

Title & Abstract	Author(s)	Institution	Link
Do OS abstractions make sense on FPGAs? Abstract To what extent do traditional OS abstractions make sense in the context of an FPGA as part of a hybrid system? This paper introduces Coyote which supports secure spatial and temporal multiplexing of the FPGA between tenants, virtual memory, communication, and memory management inside a uniform execution environment.	Dario Korolija et al.	ETH Zurich	Paper
EMOGI: efficient memory-access for out-of-memory graph-traversal in GPUs Abstract Sparse-matrix computation	Seung Won Min et al.	UIUC	Paper
Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite Abstract OpenCL-based open-source implementation of the HPCC benchmark suite for FPGAs.	Marius Meyer et al.	Paderborn University	Paper
FReaC Cache: Folded-logic Reconfigurable Computing in the Last Level Cache Abstract Energy efficient computation	Ashutosh Dhar et al.	UIUC	Paper
Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs Abstract Explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed	Fabio Maschi et al.	ETH Zurich	Paper
Portable Linear Algebra on FPGA using Data-Centric Parallel Programming Abstract 2020 XOHW Winner PhD	Manuel Burger et al.	ETH Zurich	Paper
Specializing the network for scatter-gather workloads Abstract Explore hardware-offload of the scatter-gather primitive. This approach not only virtually eliminates CPU usage, but with suitable scheduling of responses, it also speeds up scatter by allowing parallel queries	Catalina Alvarez et al.	ETH Zurich	Paper
Weighing up the new kid on the block: Impressions of using Vitis for HPC software development Abstract Vitis case study using Himeno benchmark as a vehicle for exploring the Vitis platform for building, executing and optimizing HPC codes	Nick Brown et al.	The University of Edinburgh	Paper

2019

Title & Abstract	Author(s)	Institution	Link
AcMC²: Accelerating Markov Chain Monte Carlo Algorithms for Probabilistic Models Abstract Compiler development transforming probabilistic models into optimized hardware accelerators	Subho S. Banerjee et al.	UIUC	Paper
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs Abstract Open-source automated tool chain called Cloud-DNN. Our tool chain takes trained CNN models specified in Caffe as input, performs a set of transformations, and maps the model to a cloud-based FPGA. Cloud-DNN can significantly improve the overall design productivity of CNNs on FPGAs while satisfying the emergent computational requirements.	Yao Chen et al.	NUS	Paper
Flexible Communication Avoiding Matrix Multiplication on FPGA with HLS Abstract A flexible, fully HLS-based, high-performance matrix multiplication accelerator, capable of efficiently utilizing all available resources on the target device, including for multi-SLR FPGAs.	Johannes de Fine Licht et al.	ETH Zurich	Paper
High-Performance Distributed Memory Programming on Reconfigurable Hardware Abstract SMI is an API that unifies the flexibility and single-program, multiple-data approach of MPI with the streaming programming model of spatial architectures.	Tiziano De Matteis et al.	ETH Zurich	Paper
hlslib: Software Engineering for Hardware Design Abstract A collection of extensions for Vitis to improve developer quality of life, including CMake integration, better vectorization support, support for simulating dataflow kernels with feedback dependencies.	Johannes de Fine Licht et al.	ETH Zurich	Paper
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters Abstract System schedulers	Subho S. Banerjee et al.	UIUC	Paper
Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures Abstract Enables high-level programming of FPGAs from Python using the dataflow-based SDFG representation, allowing productive optimization of programs via provided graph transformations without modifying the input program, and code generating highly efficient FPGA kernels.	Tal Ben-Nun et al.	ETH Zurich	Paper

2018

Title & Abstract	Author(s)	Institution	Link
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks Abstract Framework for Quantized Neural Networks on reconfigurable hardware	Michaela Blott et al.	Xilinx Inc.	Paper
Transformations of High-Level Synthesis Codes for High-Performance Computing Abstract A survey of important source-to-source optimization techniques for high-throughput HLS codes to target pipelining, parallelism, and memory bandwidth utilization.	Johannes de Fine Licht et al.	ETH Zurich	Paper

2017

Title & Abstract	Author(s)	Institution	Link
Architectural optimizations for high performance and energy efficient Smith-Waterman implementation on FPGAs using OpenCL Abstract Smith-Waterman: A key bio-informatics algorithm	Lorenzo Di Tucci et al.	Xilinx Inc. and Politecnico di Milano	Paper

2016

Title & Abstract	Author(s)	Institution	Link
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Abstract Framework for Binarized Neural networks on reconfigurable hardware	Yaman Umuroglu et al.	Xilinx Inc.	Paper