D4M

D4M: Dynamic Distributed Dimensional Data Model

Dr. Jeremy Kepner

Brief Description

D4M is a breakthrough in computer programming that combines the advantages of five distinct processing technologies (sparse linear algebra, associative arrays, fuzzy algebra, distributed arrays, and triple-store/NoSQL databases such as Hadoop HBase and Apache Accumulo) to provide a database and computation system that addresses the problems associated with Big Data. D4M significantly improves search, retrieval, and analysis for any business or service that relies on accessing and exploiting massive amounts of digital data. Evaluations have shown D4M to simultaneously increase computing performance and to decrease the effort required to build applications by as much as 100x. Improved performance translates into faster, more comprehensive services provided by companies involved in healthcare, Internet search, network security, and more. Less, and simplified, coding reduces development times and costs. Moreover, the D4M layered architecture provides a robust environment that is adaptable to various databases, data types, and platforms.

Software Download (includes Patent rights)

[Free Version (GPL)] [U.S. Government Agency] [U.S. Government Contractor] [Commercial End User] [Commercial With Sub-licensing]

Primary Citation (Kepner et al, ICASSP 2012)

Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System, [Paper] [Slides] J. Kepner, W. Arcand, W. Bergeron, N. Bliss, R. Bond, C. Byun, G. Condon, K. Gregson, M. Hubbell, J. Kurz, A. McCabe, P. Michaleas, A. Prout, A. Reuther, A. Rosa & C. Yee, ICASSP (International Conference on Accoustics, Speech, and Signal Processing), Special session on Signal and Information Processing for "Big Data" (organizers: Bliss & Wolfe), March 25-30, 2012, Kyoto, Japan

Documentation

Please see the eight lecture course with many code examples included with the software download
D4M Class: Signal Processing on Databases (MIT Fall 2012)
D4M Baseball Demo by Dylan Hutchison

Goal

To dramatically reduce the time to develop complex algorithms for analyzing large data sets.

Audience

Data Scientists and Algorithm Developers with a strong background in Linear Algebra.

Currently Supported Environments and Databases

Matlab and GNU Octave (with Octave Java). Triple Stores (Accumulo and potentially HBase); SQL (via JTDS bindings).