The Human Interface to Large Multimedia Databases
by Ben Howell Davis, Linn Marks, Dave Collins, Robert Mack, Peter Malkin, and Tam Nguyen
from the SPIE (The International Society of Optical Engineering) Conference: High-Speed Networking and Multimedia Computing, 1994.
The emergence of high-speed networking for multimedia will have the effect of turning the computer screen into a window on a very large information space. As this information space increases in size and complexity, providing users with easy and intuitive means of accessing information will become increasingly important. Providing access to large amounts of text has been the focus of work for hundreds of years and has resulted in the evolution of a set of standards, from the Dewey Decimal System for libraries to the recently proposed ANSI standards for representing information on-line: KIF, Knowledge Interchange Format, and CG's, Conceptual Graphs.1 Certain problems remain unsolved by these efforts, though: how to let users know the contents of the information space, so that they know whether or not they want to search it in the first place, how to facilitate browsing, and, more specifically, how to facilitate visual browsing. These issues are particularly important for users in educational contexts and have been the focus of much of our recent work. In this paper we discuss some of the solutions we have prototyped: specifically, visual menus, visual browsers, and visual definitional sequences.
Imagine walking through a crowded stadium or auditorium, and suddenly picking out a friend's familiar face. This ability to almost instantly recognize complex visual patterns is so commonplace that we sometimes forget how extraordinary it is. Scientists attempting to duplicate abilities such as face recognition using computers concluded two decades ago that "..the ultimate question...remains unanswered...It has once again been clearly shown that the human viewer is a fantastically competent information processor."2 This conclusion has been echoed by other scientists more recently. Steven Pinker, summarizing studies of visual cognition in people and machines, wrote in the mid-1980's that "Recognizing and reasoning about the visual environment is something that people do extraordinarily well; it is often said that in these abilities an average three-year old makes the most sophisticated computer vision system look embarrassingly inept."3
Yet computer science research to date, particularly image database research, has focused largely on replacing rather than exploiting the power of the human visual system. In spite of growing recognition that end users want and need visual representations,4 many developers of large image databases focus on feature recognition algorithms that depend on the use of textual queries for searching libraries of images.5 Some commercial image database systems that permit visual searches are beginning to appear,6 but the use of visual processing in database systems is typically limited to simple iconic representation of functions, rather than the much richer visual representation of content.7
In light of the power of the human visual system to recognize and reason about the visual environment, we have focused in our research on how to design visual interfaces for multimedia databases that exploit these uniquely human abilities. In the work reported here, we describe prototypes of three aspects of visual interfaces: visual menus, visual browsers, and visual definitional sequences. By providing a rich representation of the visual content of databases and exploiting the power of the human visual system, these interface designs provide users with easy and intuitive means of searching for and accessing information, and with enough information about the contents of the information space to decide whether or not to search it in the first place. Each of these prototypes was designed and developed in the absence of high-speed networking for multimedia, but in anticipation of the problems and promise of this emerging technology.
Interfaces that require verbal input from the user have limited usefulness as interfaces for visual databases. First, they do not take advantage of the power of the human visual system. Second, verbal descriptions of visual information are often inadequate: even clear and accurate verbal descriptions are often inadequate for aesthetic reasons.
It follows that common models of interface design are inadequate for large multimedia databases. The most common designs are menu-driven, command-driven, and/or direct manipulation. In menu-driven designs, users are provided with a set of menus and make selections from them. In command-driven interfaces, users enter commands by typing them in, or, in state-of-the-art interfaces, by writing on a tablet (handwriting recognition) or talking to the computer (speech recognition). In direct manipulation interfaces, users manipulate icons and other interactive elements of the interface. Menu-driven and direct manipulation interfaces were originally designed as user-friendly alternatives to command-driven interfaces, but the more recent versions of command-driven interfaces -- that is, natural interfaces involved natural language, handwriting, and speech input -- are now user-friendly alternatives to menu-driven interfaces. The problem, however, is that none of these designs are user-friendly in the context of multimedia.
Command-driven interfaces are difficult to use in the context of multimedia, for example, because it is difficult to compose a verbal description of a visual object. Typical menu designs are not very user-friendly for a similar reason: most menus consist of verbal labels. Common menu designs include pull-down menus, consisting of a control bar at the top of the screen that displays the main topics, connected to menus that can be "pulled down" from the control bar to display sub-topics; cascading menus, consisting of a list of main topics on one side linked to one or more lists of sub-topics cascading to the other side; and full-screen menus, consisting of a screen that displays a list of topics.
Numerous studies by cognitive psychologists have demonstrated that there are two basic forms of memory, recognition and recall, and that recognition is faster and more accurate than recall.8 It follows that menu-driven interfaces, which rely on recognition, are usually easier to use than command-driven interfaces, which rely on recall. Although it is usually easier to recognize a verbal label than to compose a verbal description, however, neither one is optimal for visual databases, because verbal labels cannot adequately represent many visual images. In the words of a common aphorism, "a picture is worth a thousand words". Most menu labels, in contrast, consist of fewer than five words. The optimum menu design for a database of visual information is therefore a visual menu that contains a rich, concise visual representation of the contents of the database.
"Man Ray's Paris Portraits" contains a full-screen visual menu consisting of small still versions of the portraits taken by the photographer, Man Ray.9 Each portrait serves as an index to the full portrait and to the collection of visual and verbal information about the person in the portrait. The menu was designed by Davis at Project Athena, MIT.
The database was based, originally, on Man Ray's Paris Portraits: 1921-39. Man Ray was a photographer who "...did not take photographs, but created them. Each portrait was a separate little adventure; the resultant print a work of art...By the end of the mid-1920s, few of the Parisian social and artistic hierarchy had not crossed the threshold of Man Ray's studio...Surely in the annals of achievement in twentieth century portraiture, Man Ray would have no equal."10
The application was designed to be an electronic, hypermedia version of Man Ray's Paris Portraits: 1921-39, and was developed by the Visual Computing Group at MIT's Project Athena in order to gain insights into how to design an electronic, visual history book based on portraiture: that is, a visual history with a human face, or human interface. The developers began with a small visual database of about 1000 slides, based on the portraits in the book, and added other material available at Harvard and MIT: a section of a silent film by Leger, Ballet Mechanique, obtained from the Harvard Film Archives; slides of various artists' work obtained from the Rotch Visual Library at MIT's School of Architecture and Urban Planning; and some of Stravinky's Firebird.
How would one enter the electronic book, though? Man Ray's images were the answer. Postage-stamp size images of thirty of the portraits were arrayed on a high-resolution screen (1280 x 1024 pixels) in a full-screen visual menu. Man Ray's portraits of Hemingway, Picasso, Braque, Proust, Duchamp, Breton, Stein, Stravinsky, Leger, Miro, Gris and others were thus transformed into a menu for a database of information about themselves.
By selecting a portrait -- Leger's, for example -- the screen transforms itself into a visual and textual catalog. A full-motion video window (640 x 480 pixels) appears with on-screen buttons for slow, fast, step, and real-time speeds. The video clip from Ballet Mechanique is displayed along with text. Various words and phrases referring to the film are highlighted in the text and, as readers select them, the images appear in the video window. Other options allow users to review Leger's paintings from the period, as well as text descriptions.
Users may return to the main screen and select other options from the portraits. They may select Picasso and Miro, for instance, split the screen using the X Window manager, and review the contents of the visual files simultaneously in order to compare paintings made during the same years. Text descriptions are also provided. Stravinsky's file contains a text on Firebird and, by selecting highlighted phrases, users can hear sections of the work. The main screen also allows users to explore further information on Man Ray, himself: paintings, photographs, written material, and, eventually, his films.
Clearly, the optimum interface for a multimedia database based on a work such as Man Ray's Paris Portraits: 1921-39, is a visual menu. Such a visual menu provides users with a rich visual introduction, and one that is aesthetically appropriate for the information in the database.
Visual menus of the kind designed for the Man Ray application function as more than menus: they also function as visual browsers, since the size of the database, the size of the monitor, and the screen resolution make it possible to index a large portion of the collection on a single screen. This will not be true for the next generation of digital multimedia databases, however, for they will contain larger amounts of information and will require multi-screen browsers, accessible by scrolling or paging from one screen to another.
Why is it important to provide users with information about the contents of the information space? One reason is to enable them to decide whether or not they want to search through the database and, if so, what they can expect to find. In a library a user can browse through stacks and easily find books; in a video store, a consumer can get an overview of a film by taking a quick look at the cover of a box. What will the equivalent be in the digital library, or digital video library, of the future?
Consider book design, where similar design problems have been worked out over several centuries. Before readers open a book, they see the title, the author's name, and other information on the front or back: an illustration, a photograph, a summary, or reviewers' comments. After they open the book, they see information on the inside of the book jacket, in the table of contents, and in the preface or introduction. They might skim the first few pages to find out more about the content or about the author's point of view. They might flip through the book, looking for photographs, diagrams, or other visual information. Regardless of the particular order in which they look at the book design or the amount of attention they give to any specific aspect of it, they absorb enough information about the book to decide whether or not they want to read it.
In contrast, when users look at a multimedia application, they typically see a menu providing an overview of the principal content of the application on the first or second screen. Few multimedia interfaces provide methods of skimming text, flipping through stills, or scanning large amounts of video to get an overview. Therefore, one or two screens provide almost all of the information on content that users see before they start to use an application. This is sufficient when the contents of a single-screen menu can index a large portion of the database, as in the case of the visual menu for the application based on Man Ray's portraits. But it is not sufficient when the information displayed on a single screen cannot adequately represent the contents of the database. Navigating through a database involves hard work: users have to make decisions about what to see many times, not just once, as in the case of books. When readers decide to read a book, they have no more choices to make: they start at the beginning and read through to the end. Good writers guide them easily from one thought or insight to the next, from one character or setting to the next. In contrast, navigating through a multimedia information space involves making multiple choices about what to read, view, or hear.
In the case of visual and multimedia databases, an interface can facilitate navigation by providing users with visual browsers. Two kinds of "compression" are required to design visual browsers: spatial compression and temporal compression. The visual menu designed for the Man Ray application consists of spatial compression: still images reduced to postage-stamp size and arrayed on a screen. Still images are two-dimensional, so this kind of two- dimensional spatial compression is sufficient. Video consists of three-dimensional visual information, however, not two-dimensional information. Therefore, in the case of video, temporal compression is required as well as spatial compression. From a full-motion (thirty frames per second) film that is two hours in length, for example, a single frame or a small number of frames has to be selected to represent the entire film.
The Knowledge and Collaboration Machine (KCM) contains a visual browser consisting of a set of small stills or "picons" ("picon" = picture icon), where each of the picons represents a video clip in the database. The browser was designed and developed by Marks, Collins, Mack, and Malkin at IBM Research.
The goal of KCM was to create an interactive multimedia database and authoring environment that would enable users to perform typical research and writing tasks.11 Users can do research, for example, by searching a multimedia database. They can retrieve and store subsets of the database to use in composing multimedia documents, such as term papers. The initial focus of the project was to develop this kind of environment for undergraduate engineering students studying a problem from the real world: the expansion of the San Diego International Airport and its impact on the surrounding communities. Later, the target users were changed from undergraduates to professional interface designers, and the content was changed from expansion of the San Diego International Airport, to IBM's Common User Access (CUA) Interface Guidelines. It is important to note that despite the change of users and content, the design of the system changed very little, for the utilities created for the first application were appropriate for the second one, as well.
KCM was designed as a distributed system and includes an object-oriented database.12 Text components of multimedia objects, including starting and stopping frame information for video clips, are stored in the database and retrieved on demand. Video is stored on the user's workstation, using optical laserdiscs. Although the architecture is extensible enough to allow storage of digitized video on the database, the available network is not currently fast enough.
In general, the time taken to transmit motion video over a network and play it on a local workstation makes browsing expensive. Reducing the network cost, as well as the human cost, of browsing large multimedia databases was thus one of the objectives of the KCM architecture. A lightweight surrogate object is associated with each multimedia object in the database. Users can browse through surrogate object collections, requesting the real (usually very large) object only if the surrogate object appears interesting. Much of the architecture in KCM is aimed at making the existence of surrogate objects and the network transparent -- the user perceives that objects, once acquired, are local to the workstation.12,13
Picons are one class of surrogate object in KCM: small picture icons (typically the image of a single frame) associated with a video clip. Picons also carry publication information for the clip and a brief verbal description, both in the form of text. The picon browser allows users to view the picons and related text in successive "pages". Collections are usually generated in response to user queries, but pre-selected collections may also appear as folders on the user's "desktop."
Each page of the picon browser is similar to a visual menu, as in the Man Ray application, but picons are self- referential: that is, they serve as a visual index to the specific video clip they represent, not to related multimedia information. Picons are displayed in six rows and columns, 36 to a page. The user can navigate from page to page using buttons on the screen or the PgUp and PgDn keys on the keyboard. Selection within a page involves the mouse or the keyboard arrow keys. The browser also has a scrollable list of titles: selecting a title pages the browser and positions the cursor on the corresponding picon. When the user selects a particular picon, the title of its associated clip is highlighted in the list, and a short verbal description of the contents of the video clip appears in a text window in the lower left-hand corner of the screen. Descriptions consist of two or three sentences of information not evident from the still image: for example, information about what occurs at a later point in the video clip.
A clip is retrieved from the laserdisc (or, in the future, the network) if and when the user expresses interest by clicking on its picon. Typically, a picon uses at least two orders of magnitude less storage than a video clip, and can be retrieved and displayed in tenths of a second, versus tens of seconds. The user can view and compare 36 picons at a time, adding an addditional order of magnitude improvement in performance. Our experience is that "many are called, but few are chosen" -- that is, users will inspect, and quickly reject, many objects before finding one they wish to view in detail. Note that even if networks could transmit video in zero time, at zero cost, the human efficiency of this sort of browsing would still be true and would be an important consideration in the design of a visual browser.
Picons are examples of both spatial and temporal compression: spatial compression since they are small representations of a two-dimensional image, and temporal compression since they are single frames selected from a three-dimensional linear sequence of multiple frames. The short verbal description of the contents of the video clip is another kind of compression: a conceptual compression. This kind of design makes the browser compact, yet highly informative. It provides an information-rich environment that can support students in the sometimes tedious, difficult task of searching through a database for specific information and/or video clips.
To perform the same task in the future, users will most likely need even more information than the KCM picon browser provides. Even now it takes more work to navigate through a multimedia application than to navigate through a book, and the multimedia database of the future will be far more complex: more like a library than a book. Navigating through such an information space will be a meaningless, mindless experience unless the user can construct a coherent and meaningful representation of the information: a cognitive macrostructure.14
To illustrate the importance of macrostructures, imagine two learning experiments. In Experiment A, a rat is in a cage. It presses on a key, opens a door, and learns to run through a maze. In Experiment B, a student is using a computer. S/he clicks on a mouse, opens a window, and learns to navigate through a database. Cognitive scientists like to think that the student is doing something very different from the rat. But if the student is moving from one point in an information space to another without constructing a coherent representation of the content -- mindlessly navigating through a maze -- then the differences may be less important than the similarities. After all, rats are competent navigators: they can learn to navigate, just as students can. But learning to navigate through an information space is not the same as learning about the information space: it is certainly not the same as learning enough to construct a conceptual macrostructure of the information space. This ability, like the ability to recognize complex visual patterns, is uniquely human.
The theoretical basis for macrostructures has been worked out by cognitive scientists principally in terms of discourse processing, but the theory applies to comprehension and recall of sequences of events and images, as well as words.15 Empirical research based on the discourse processing model shows that people use macrostructures to help in comprehension and recall of large amounts of information presented in text.16 This research - - what we might refer to as human-book interaction research -- suggests that skill in constructing macrostructures is important in the comprehension and recall of text, and that skill in designing effective representations of macrostructure is important in the design of usable texts. The fact that macrostructures help us understand and reduce large amounts of information to their gist makes macrostructure theory relevant to issues in interface design, in general; the fact that the basic principles apply to the meaning of information and not to its form (verbal vs. visual) makes them relevant to multimedia interface design, in particular.
Constructing macrostructures is difficult, even in the context of linear text. It is likely to be far more difficult in the context of interactive multimedia, which sometimes has the characteristic of being a collection of "factoids."17 In order to ensure that users learn as they "cybersurf", interface designers need to find ways of providing conceptual tools that will facilitate the construction of macrostructures. This will be particularly important in educational or instructional contexts, where students will be expected to learn new material by using databases in unfamiliar domains.
Representations of macrostructure are a form of conceptual compression, and vice versa. The verbal descriptions that accompany picons in the KCM picon browser are, therefore, a simple form of macrostructure: a condensation. Three kinds of compressions can be used in interface design, then, to provide users with the maximum amount of visual information in the minimum amount of space: spatial compression (a small still image), temporal compression (a single frame from a film); and conceptual compression (a visual summary of conceptual content).
Definitions are another form of macrostructure, for they provide a summary of the meaning of a concept or category. When definitions apply to visual categories, they can be rendered in visual form. For example, the "visual definition" of "red" can be rendered as the color red. When definitions apply to visual sequences of information, they can be rendered as "visual definitional sequences". For example, definitions for the concept of speed or for the category of time lapse photography can be rendered more easily in visual sequences than in still images or words. Visual definitional sequences are one of several conceptual tools that can facilitate the construction of macrostructures in the context of interactive multimedia.
"Understanding Visualization" begins with a set of visual definitional sequences for a set of categories that relate to the visible attributes of objects.18 The purpose of the categories is to provide users with a way of parsing visual information, so that they can then intelligently search a database of visual information. Since the categories can be used for this purpose only if users understand what they mean, they need to be defined, and since they are categories for dynamic visual information, they are more easily defined in visual sequences than in still images or words. The sequences were designed and developed by Marks, based on a set of categories developed by Davis.
The "Understanding Visualization" prototype was developed as part of an undergraduate curriculum on visualization in the arts and sciences. The prototype was designed to be a front-end for a database of information about the evolution of visual rendering techniques and technologies in the arts and sciences. What are visual rendering techniques and technologies? Consider perspective, which is a visible attribute of objects. Knowledge of perspective is not innate. The principles governing the construction of one-point perspective, for example, were not discovered until the Renaissance, when they were introduced in the art of Brunelleschi, Donatello, Masaccio, and others, and articulated by the Italian mathematician and painter, Alberti.19 Thus, knowledge of perspective and visual rendering techniques related to perspective -- how to paint a three-dimensional object on a two-dimensional canvas so that it looks three-dimensional -- has evolved over time.
Perspective is one of the four conceptual categories of visual rendering techniques and technologies at the core of the "Understanding Visualization" prototype. The other categories are appearance, dimension, and measurement. In order to use these high-level concepts to understand visualization and to intelligently search through a database of visual information, users need to understand not only the meaning of the categories, but their meaning in the context of visual information. The visual definitional sequences were designed to provide this foundation of understanding. To illustrate and "define" perspective, for example, a visual sequence shows simultaneous, multiple perspectives on the same object. The perspective (camera angle) shifts from outside looking in, to inside looking out, and from right side up to upside down.
The high-level categories are divided into several sub- categories, as well: for appearance -- shape, color, contrast, texture, and opacity; for dimension -- time, speed, and process; for measurement -- relative size and scale; and for perspective -- vantage point and perspective. One of the reasons for using visual sequences, rather than still images, is that sequences can illustrate how concepts relate to one another. To illustrate the various aspects of appearance, for example, visual sequences show alternately changing shapes, colors, contrasts, textures, and opacities. One sequence shows the underwater view of a jelly fish, with varying opacities as the relative position of the light source and the camera vary over time. Another shows a sequence of Landsat photographs of Manhattan taken at different distances and illustrates how shape changes with distance. To illustrate dimension, time lapse sequences show how time, speed, and process can co-vary.
Like the visual menu for Man Ray and the visual browser for KCM, these visual definitional sequences were designed to exploit the power of the human visual system to recognize and reason about the visual environment, and to provide the maximum amount of information possible in the minimum amount of space. As the two-dimensional interface opens a window onto an increasingly larger and more complex three- dimensional information space, the need for highly informative, efficient visual interfaces of this kind will no doubt increase. The current state of the art suggests that interface designers do not yet take into account the wide spectrum of users who will need to use the information superhighway if it is, in fact, to become a superhighway.
1. J. Sowa. "Multimedia Dialog Manager." Talk presented at IBM Research T.J. Watson Research Center. 1993.
2. L.D. Harmon. "The Recognition of Faces." In Held, R. (ed.). Image, Object, and Illusion. pp. 100-112. W.H. Freeman, San Francisco, CA. 1974.
3. S. Pinker. Visual Cognition. MIT Press, Cambridge, MA. 1985.
4. E.J. Farrell (ed.). IBM Journal of Research and Development 35, 1/2. Special issue on "Visual Interpretation of Complex Data." 1991.
5. W.I. Grosky and R. Mehrotra (eds.). Computer (IEEE) 22, 12. Special issue on "Image Database Management." 1989.
6. P. Parisi. "The Picture Exchange." Wired 2, 1. p. 32. 1994.
7. N.C. Shu. Visual Programming. Van Nostrand Reinhold, New York, NY. 1988.
8. J.R. Anderson. The Architecture of Cognition. Harvard University Press, Cambridge, MA. 1983.
9. M.E. Hodges and R.M. Sasnett. Multimedia Computing: Case Studies from MIT Project Athena. Addison-Wesley Publishing Company, Reading, MA. 1993.
10. Man Ray's Paris Portraits: 1921-39. Forward by T. Baum. Middendorf Publications, Washington, D.C. 1989.
11. R. Mack, P. Malkin, D. Collins, M. Utpat, M. Laff, J. Richards, and L. Marks. "Smalltalk prototyping of CUA 1991 workplace and multimedia information management concepts." Research Report, IBM Research Division, T.J. Watson Research Center (in progress).
12. P. Malkin. "The KCM Distributed Database Management System." Research Report, IBM Research Division, T.J. Watson Research Center. 1992.
13. D. Collins. Designing Object-Oriented User Interfaces. Benjamin/Cummings Publishing Company, Redwood City, CA. 1995 (in press).
14. L. Marks. "Semantic macrostructures and the conceptual shape of multimedia." (Dissertation, in progress).
15. T.A. van Dijk. Macrostructures: An Interdisciplinary Study of Global Structures in Discourse, Interaction, and Cognition. Lawrence Erlbaum Associates, Hillsdale, NJ. 1980.
16. T.A. van Dijk and W. Kintsch. Strategies of Discourse Comprehension. Academic Press, NY. 1983.
17. A.C. Kay. "Computers, networks, and education." Scientific American. 138-148. September, 1991.
18. L. Marks and B. Davis. Integrative Multimedia Design. MIT Press, Cambridge, MA. 1995 (in press).
19. L.B. Alberti. On Painting. Yale University Press, New Haven, CT. 1966 [1436