My research interests are in developing energy efficient systems for portable multimedia devices. Power and battery life have become critical concerns with the ever increasing use of multimedia applications, such as video playback, on portable devices. I am interested in exploring power reduction techniques at various stages of the design, including algorithms, architectures and circuits.
Highly parallel processing and ultra-low voltage operation are the key factors in designing systems for low power. With continuing scaling of transistor technology, local variations are becomming very significant. Moreover, these variations affect the logic timing and SRAM functionality very significanly at low-voltage operation. Using conventional corner based timing analysis requires large design margins, that reduce area and power efficiency of the system. In order to address this problem, I worked on developing a Statistical Static Timing Analysis (SSTA) methodology for performing timing analysis while designing ICs operating at ultra-low voltages, as part of my master's thesis. This technique can accurately predicts the statistical circuit performance in presence of transistor variations at low voltage.
Reconfigurable Processor for Computational Photography
January 2011 - May 2012
Computational photography applications significantly extend and enhance the capabilites of existing cameras. The high computational complexity of such multimedia processing applications necessitates fast hardware implementations to allow real-time processing. This work implements a reconfigurable multi-application processor to enable energy-efficient real-time computational photography on portable multimedia devices.
The reconfigurable hardware implements Bilateral filtering - a non-linear filtering technique with wide range of computational photography applications, and implements it using a Bilateral Grid structure, whcih represents an image using a 3D data structure and filters it using a 3D Gaussian kernel. The processor implements High Dynamic Range (HDR) imaging, Low-Light Enhancement, by merging flash and non-flash images such that the natural scene ambience is preserved while avhieving high details and low noise, and Glare Reduction. The filtering engine can also be accessed from off-chip and used with other applications.
The implementation significantly accelerates bilateral filtering and enables various edge-aware image processing applications in real-time on HD images. The processor, implemented using 40 nm CMOS technology, is operational from 25 MHz at 0.5 V to 98 MHz at 0.9 V. The testchip achieves 13 megapixel/s throughput while consuming 1.4 mJ/megapixel energy at 0.9 V - a significant energy reduction compared to CPU/GPU implementations.
The processor is combined with an FPGA, which implements a DDR2 memory interface and an USB interface. This allows processor integration with DDR2 memory, camera and a host PC and provides a portable system for live computational photography.
In the News
Picture Perfect: Quick, efficient chip cleans up common flaws in amateur photographs. (MIT News)
Multi-Standard Low-Power Video Coding
Septermber 2009 - December 2010
This project aims to design a reconﬁgurable video encoder supporting H.264/AVC High Proﬁle and VC-1 Advanced Proﬁle video coding standards with 4k x 2k resolution at 30fps, implemented on a single low-power ASIC. The work explores algorithmic, architectural, and circuit-level innovations that can be applied to each of the functional blocks in a multi-standard video encoder to enable low-voltage operation while maintaining performance.
Transform engine is a critical part of the video codec and increased coding efficiency often comes at the cost of increased complexity in the transform module. In this work we propose a shared-reconfigurable transform engine for H.264/AVC and VC-1 video coding standards, using the structural similarity and symmetry of the transforms for H.264/AVC and VC-1. An approach to eliminate the need for an explicit transpose memory in 2D transforms is proposed. Data dependency is exploited to reduce power consumption. Ten different versions of the transform engine, such as with and without hardware sharing, with and without transpose memory, are implemented in the design. The design is fabricated using commercial 45nm CMOS technology and all implemented versions are verified. The shared-reconfigurable transform engine without transpose memory supports Quad Full-HD (3840x2160) video encoding at 30fps, while operating at 0.52V, with measured power of 214 uW.
SSTA Design Methodology for Low Voltage Operation
Septermber 2008 - April 2010
In order to achieve ultra-low power (ULP), ICs are being designed for supply voltages less than 0.5V. At these low voltages, random dopant fluctuations (RDFs) result in a stochastic component of logic delay that can be comparable to the global corner delay. Moreover, the probability density function (PDF) of this stochastic delay can be highly non-Gaussian. In order to predict the statistical impact of RDF-induced local variations on logic timing, it is necessary to incorporate these effects into a timing closure methodology. This work proposes a computationally efﬁcient methodology for stochastic characterization of standard cell libraries at low voltage, where the cell delay is a nonlinear function of the transistor random variables (RVs), and the resulting cell delay has a non-Gaussian PDF. It also presents a computationally efﬁcient methodology for computing any point on the PDF of a timing path (TP) delay, in the case where cell delays are non-Gaussian. The method is called Operating Point Analysis (OPA). This work develops the general OPA theory and applies to cell library characterization, timing path analysis and full-chip timing closure. The approach has been implemented using commercial CAD tools, and integrated into a commercial IC design ﬂow.
The approach is validated by comparison to Monte Carlo simulation. The OPA approach gives timing results that are within 5% accuracy compared to Monte-Carlo analysis at 0.5V. This compares to errors on the order of 50% with Gaussian SSTA. OPA based timing closure methodology is used on the design of a 28nm DSP SoC operating down to 0.6V, in order to more accurately model the effects of local variation and achieve a reliable design with minimal pessimism.
- MTL Annual Research Conference Presentation Award - 'Reconfigurable Processor for Computational Photography', 2013
- Ernst Guillemin Award for best S.M. thesis in Electrical Engineering, MIT, 2010
- MTL Annual Research Conference Presentation Award - 'Low-Power Multi-Standard Ultra-HD Video Codec', 2010
- Best B.Tech. Thesis Award, IIT Kharagpur, 2008
- InfoUSA Summer Research Fellowship, Univrsity of Southern California (USC), 2007