August 2016: I am looking for motivated PhD students with backgrounds in digital design/FPGAs to join my research group. Please contact me if interested.
Energy efficiency has become the limiting factor of current and future computing performance, affecting computing systems of all kinds, from mobile devices to datacenters. Meanwhile, modern applications continue to grow more complex and computationally expensive, while relying on larger amounts of data. This presents a considerable challenge: how can we continue to improve our computational capabilities in spite of these limitations?
A key technique to improve energy efficiency and reach high performance is hardware specialization. Recently, there has been much interest in using field-programmable gate arrays (FPGAs) as accelerators in general-purpose computing environments. Their fine-grained parallel structures allow them to exploit the benefits of hardware-level customization while they still allow reprogrammability.
However the biggest obstacle limiting the growth of FPGAs is the difficulty of implementing algorithms in hardware and integrating this hardware into real-world computer systems. My research aims to address these difficulties by combining the areas of digital hardware design with compilers, tools, and domain-specific languages. More specifically, my work explores how we can use computer-based tools to make digital hardware more efficient, how we can reduce the effort needed to design, optimize, and verify digital systems, and how these technologies can be exploited to address key challenges in modern computing.
Below you will find high-level descriptions of my current research work and information for a few selected papers. For a full list of papers please see my Publications page.
Deep learning and convolutional neural networks (CNNs) have revolutionized machine learning, leading to recent advances in several areas such natural language processing and computer vision, and widespread interest from industry and academia. However, these advances come at a steep computational cost. The goal of this project is to enable implementation of large-scale deep learning applications on a scalable parallel “cloud” of FPGAs by automating the translation from straightforward algorithmic specifications of deep learning problems into optimized hardware, parallelized across many interconnected FPGAs.
This work is funded by the National Science Foundation's Exploiting Parallelism and Scalability (XPS) program.
Fused Layer CNN Accelerators [PDF]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder
To appear at the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) 2016.
Overcoming Resource Underutilization in Spatial CNN Accelerators [PDF]
Yongming Shen, Michael Ferdman, and Peter Milder
To appear at the International Conference on Field-Programmable Logic and Applications (FPL) 2016.
In order to reduce the difficulty of implementing FPGA and ASIC accelerators, researchers have proposed a number of different types of automated systems. Some of these take the form of parameterized IP (intellectual property) cores, which are implementations of a given problem created by an expert with a small amount of flexibility through parameters. At the other end of the spectrum are tools such as “high-level synthesis” (HLS) that aim to convert C or C++ code directly into hardware. In practice, typical parameterized IPs are too restrictive, forcing designers into a “one-size-fits-all” approach; meanwhile, HLS is too open-ended: by trying to work well for all problems, it is too difficult to produce good solutions.
My work aims to address these problems through the use of domain-specific hardware generation tools. These tools target a specific domain of problems (e.g. linear DSP transforms), providing enough flexibility to work well for a variety of different problems in the domain, while being targeted enough that they can produce very good results with little effort from the end user. One example of this is my work on the Spiral hardware generation framework, a domain specific hardware generation tool for linear signal processing transforms such as the fast Fourier transform. This system uses a mathematical domain-specific language (DSL) to optimize transform algorithm hardware; its results are competitive with (and often are more efficient than) hand-designed systems.
My ongoing work aims to create a flexible framework for creating domain-specific hardware generators, improving their usability, and using the results to study new application domains.
Marcela Zuluaga, Peter Milder, and Markus Püschel
ACM Transactions on Design Automation of Electronic Systems, Vol. 21, No. 4, Article 55, May 2016.
Peter Milder, Franz Franchetti, James C. Hoe, and Markus Püschel
ACM Transactions on Design Automation of Electronic Systems, Vol. 17, No. 2, Article 15, April 2012.
Winner, 2014 ACM TODAES Best Paper Award.
Michael K. Papamichael, Peter Milder, and James C. Hoe
Proceedings of Design Automation Conference (DAC), 2015.
"Smart" Design Space Sampling to Predict Pareto-Optimal Solutions [info, PDF]
Marcela Zuluaga, Andreas Krause, Peter A. Milder, and Markus Püschel
Proceedings of Languages, Compilers, Tools and Theory for Embedded Systems, 2012.
See also the Spiral DFT/FFT hardware generator, which produces high quality designs over a very wide tradeoff space, allowing users to choose designs that best match their implementation-specific tradeoff goals, balancing cost (power, energy, area) against performance (throughput, latency). The system is able to produce cores that compare well with existing designs in the literature or in IP libraries and enables higher performance/cost design points than otherwise available.
Datacenters (large-scale computing centers comprised of large numbers of servers) have become ubiquitous in modern computing, but are severely power constrained. Although typical datacenter applications are not traditional targets for hardware acceleration, their strict power limits have made FPGA acceleration an attractive target. However, typical datacenter applications can be considerably challenging to accelerate with FPGAs. The goal of this work is to study how FPGAs can improve efficiency and speed of large-scale datacenters and their applications.
This work is supported by the Semiconductor Research Corporation.