Hardware/software co-design for scientific computing and convolutional neural networks on FPGA-based embedded architectures

FPGAs are increasingly used in embedded heterogeneous architectures as they provide high computational power, low power consumption, flexibility, and adaptability thanks to their reconfiguration property. The rapid evolution of fields such as Machine Learning, Statistical Computing, and Biomedical Computing, along with the ending of Moore's era, is moving the attention of industry and academia towards less traditional computer architectures that can satisfy the ever-increasing need for high computational power and low power consumption, with FPGA-based Heterogeneous System Architectures (HSAs) being a promising solution. However, owing to the learning curve required to implement FPGA-based accelerators, the number of applications that benefit from their use is quite small. Although the technology of these devices continues to evolve, their difficulty in use is still preventing them from spreading widely. Today, it is a frustrating task to integrate FPGA-based hardware accelerators into software-based applications. The design flow currently available demands specialized expertise and mastery of low-level techniques that are actually out of reach for most software developers, and while High Level Synthesis tools reduce certain complexities through the use of High Level Languages (HLLs), the whole integration process remains complicated and not very automated. To solve the FPGA-based HSAs programmability and usability challenges, this thesis focuses on the formalization and implementation of techniques and tools to develop and exploit accelerators that take advantage of reconfigurable embedded architectures and integrate them within more complex heterogeneous infrastructures. The goal is twofold. First, the thesis aims at providing new workflows and innovative techniques to hardware developers in order to optimize specific domain functions and to distribute their designs ready to be integrated by end users. In particular, we focus on accelerating applications for Scientific Computing by taking advantage of the programmable logic of the FPGA. We propose innovative FPGA-based hardware/software designs for Audio Signal Alignment, Template Matching and Biomedical applications capable of outperforming their software versions on embedded devices. Moreover, for applications based on Convolutional Neural Networks, which are generally difficult to implement on embedded systems due to their hardware resource constraints, we propose methodologies for their mapping to a distributed embedded heterogeneous system, i.e., where multiple nodes of the infrastructure consist of embedded devices integrating both a CPU and an FPGA. Second, in parallel to this, the thesis aims at providing software developers with new methodologies to take advantage of FPGA-based hardware accelerators transparently directly from HLLs. Starting from the efficient management of the on-chip memory of FPGAs, where we propose a C++ HLS library to transparently exploit parallel and polymorphic memory accesses, up to the automatic runtime management of the heterogeneous system directly from Python, where we propose an algorithm to predict the best hardware device to execute each function based on the current context of the HSA.

Le FPGA sono considerate una valida risorsa come unità di calcolo in applicazioni embedded grazie alle loro prestazioni, all'efficienza energetica e alla capacità di affrontare i guasti del sistema e di adattarsi. La fine della legge di Moore e lo sviluppo di campi come l'Intelligenza Artificiale e la Biologia Computazionale, stanno infatti spostando l'interesse dell'industria e del mondo accademico verso architetture di calcolo meno convenzionali, in grado di soddisfare la sempre crescente domanda di prestazioni ed efficienza energetica, una scelta esemplare è proprio quella delle FPGA. Tuttavia, il numero di applicazioni disponibili è limitato a causa della ripida curva di apprendimento necessaria per realizzare acceleratori basati su FPGA. Non importa quanto la tecnologia sia maturata, la difficoltà di utilizzo costituisce ancora una barriera che sta impedendo l'adozione di questi dispositivi in molti scenari. Infatti, l'integrazione di acceleratori hardware basati su FPGA nelle applicazioni è ancora oggi un'esperienza faticosa e difficile. L'operazione di implementazione su FPGA richiede competenze specifiche e conoscenza di strumenti di basso livello che sono fuori dalla portata della maggior parte degli sviluppatori di software, e sebbene la sintesi ad alto livello attenui alcune difficoltà, offrendo almeno la possibilità di utilizzare linguaggi di alto livello, è richiesto ancora di passare attraverso un duro e lungo processo di sviluppo. Per risolvere le sfide di programmabilità e usabilità di questi sistemi eterogenei basati su FPGA, questa tesi si concentra sulla formalizzazione e sull'implementazione di tecniche e strumenti per sviluppare e sfruttare acceleratori che sfruttino architetture embedded riconfigurabili e le integrino all'interno di infrastrutture eterogenee più complesse. L'obiettivo è duplice. In primo luogo, la tesi mira a fornire nuovi flussi di lavoro e tecniche innovative agli sviluppatori di hardware al fine di ottimizzare funzioni di dominio specifico e di distribuire le loro implementazioni pronte per essere integrate dagli utenti finali. In particolare, ci siamo focalizzati sull'accelerazione di applicazioni di calcolo scientifico, proponendo design hardware/software efficienti basati su FPGA per applicazioni di allineamento di segnali audio, di template matching e biomedicali. Inoltre, per applicazioni basate su Reti Neurali Convoluzionali (che sono generalmente difficili da implementare su sistemi embedded a causa degli elevati requisiti di risorse hardware) proponiamo metodologie per la loro mappatura su una infrastruttura embedded eterogenea distribuita, cioè dove i diversi nodi del sistema sono costituiti da dispositivi embedded che integrano sia una CPU che una FPGA. In secondo luogo, parallelamente a questo, la tesi mira a fornire agli sviluppatori di software nuove metodologie per sfruttare gli acceleratori hardware basati su FPGA in modo trasparente direttamente dai linguaggi di programmazione di alto livello. Abbiamo fatto questo, a partire dalla gestione efficiente della memoria on-chip delle FPGA, proponendo una libreria C++ per sfruttare in modo trasparente accessi paralleli e polimorfici alla memoria, fino alla gestione automatica del runtime del sistema eterogeneo direttamente da Python, dove si propone un algoritmo per prevedere il miglior dispositivo hardware su cui eseguire ogni funzione in base al contesto attuale dell'architettura eterogenea.