The Human Interface to Large Multimedia Databases
by Ben Howell Davis,
Linn Marks, Dave Collins, Robert Mack, Peter Malkin, and Tam Nguyen
from the SPIE (The International Society of Optical
Engineering) Conference: High-Speed Networking and Multimedia Computing, 1994.
Copyright, 1994.
The emergence of high-speed networking for multimedia
will have the effect of turning the computer screen into a window on a very
large information space. As this information space increases in size and
complexity, providing users with easy and intuitive means of accessing
information will become increasingly important. Providing access to large
amounts of text has been the focus of work for hundreds of years and has
resulted in the evolution of a set of standards, from the Dewey Decimal System
for libraries to the recently proposed ANSI standards for representing
information on-line: KIF, Knowledge Interchange Format, and CG's, Conceptual
Graphs.1 Certain problems remain unsolved by these efforts, though: how to let
users know the contents of the information space, so that they know whether or
not they want to search it in the first place, how to facilitate browsing, and,
more specifically, how to facilitate visual browsing. These issues are
particularly important for users in educational contexts and have been the
focus of much of our recent work. In this paper we discuss some of the
solutions we have prototyped: specifically, visual menus, visual browsers, and
visual definitional sequences.
Imagine walking through a crowded stadium or auditorium, and suddenly
picking out a friend's familiar face. This ability to almost instantly
recognize complex visual patterns is so commonplace that we sometimes forget
how extraordinary it is. Scientists attempting to duplicate abilities such as
face recognition using computers concluded two decades ago that "..the
ultimate question...remains unanswered...It has once again been clearly shown
that the human viewer is a fantastically competent information
processor."2 This conclusion has been echoed by other scientists more
recently. Steven Pinker, summarizing studies of visual cognition in people and
machines, wrote in the mid-1980's that "Recognizing and reasoning about
the visual environment is something that people do extraordinarily well; it is
often said that in these abilities an average three-year old makes the most
sophisticated computer vision system look embarrassingly inept."3
Yet computer science research to date, particularly image database research,
has focused largely on replacing rather than exploiting the power of the human
visual system. In spite of growing recognition that end users want and need
visual representations,4 many developers of large image databases focus on
feature recognition algorithms that depend on the use of textual queries for
searching libraries of images.5 Some commercial image database systems that
permit visual searches are beginning to appear,6 but the use of visual
processing in database systems is typically limited to simple iconic
representation of functions, rather than the much richer visual representation
of content.7
In light of the power of the human visual system to recognize and reason
about the visual environment, we have focused in our research on how to design
visual interfaces for multimedia databases that exploit these uniquely human
abilities. In the work reported here, we describe prototypes of three aspects
of visual interfaces: visual menus, visual browsers, and visual definitional
sequences. By providing a rich representation of the visual content of
databases and exploiting the power of the human visual system, these interface
designs provide users with easy and intuitive means of searching for and
accessing information, and with enough information about the contents of the
information space to decide whether or not to search it in the first place.
Each of these prototypes was designed and developed in the absence of
high-speed networking for multimedia, but in anticipation of the problems and
promise of this emerging technology.
Interfaces that require verbal input from the user have limited usefulness
as interfaces for visual databases. First, they do not take advantage of the
power of the human visual system. Second, verbal descriptions of visual
information are often inadequate: even clear and accurate verbal descriptions
are often inadequate for aesthetic reasons.
It follows that common models of interface design are inadequate for large
multimedia databases. The most common designs are menu-driven, command-driven,
and/or direct manipulation. In menu-driven designs, users are provided with a
set of menus and make selections from them. In command-driven interfaces, users
enter commands by typing them in, or, in state-of-the-art interfaces, by
writing on a tablet (handwriting recognition) or talking to the computer
(speech recognition). In direct manipulation interfaces, users manipulate icons
and other interactive elements of the interface. Menu-driven and direct
manipulation interfaces were originally designed as user-friendly alternatives
to command-driven interfaces, but the more recent versions of command-driven
interfaces -- that is, natural interfaces involved natural language,
handwriting, and speech input -- are now user-friendly alternatives to
menu-driven interfaces. The problem, however, is that none of these designs are
user-friendly in the context of multimedia.
Command-driven interfaces are difficult to use in the context of multimedia,
for example, because it is difficult to compose a verbal description of a
visual object. Typical menu designs are not very user-friendly for a similar
reason: most menus consist of verbal labels. Common menu designs include
pull-down menus, consisting of a control bar at the top of the screen that
displays the main topics, connected to menus that can be "pulled
down" from the control bar to display sub-topics; cascading menus,
consisting of a list of main topics on one side linked to one or more lists of
sub-topics cascading to the other side; and full-screen menus, consisting of a
screen that displays a list of topics.
Numerous studies by cognitive psychologists have demonstrated that there are
two basic forms of memory, recognition and recall, and that recognition is
faster and more accurate than recall.8 It follows that menu-driven interfaces,
which rely on recognition, are usually easier to use than command-driven
interfaces, which rely on recall. Although it is usually easier to recognize a
verbal label than to compose a verbal description, however, neither one is
optimal for visual databases, because verbal labels cannot adequately represent
many visual images. In the words of a common aphorism, "a picture is worth
a thousand words". Most menu labels, in contrast, consist of fewer than
five words. The optimum menu design for a database of visual information is
therefore a visual menu that contains a rich, concise visual representation of
the contents of the database.
"Man
Ray's Paris Portraits" contains a full-screen visual menu consisting
of small still versions of the portraits taken by the photographer, Man Ray.9
Each portrait serves as an index to the full portrait and to the collection of
visual and verbal information about the person in the portrait. The menu was
designed by Davis at Project Athena, MIT.
The database was based, originally, on Man Ray's Paris Portraits: 1921-39.
Man Ray was a photographer who "...did not take photographs, but created
them. Each portrait was a separate little adventure; the resultant print a work
of art...By the end of the mid-1920s, few of the Parisian social and artistic
hierarchy had not crossed the threshold of Man Ray's studio...Surely in the
annals of achievement in twentieth century portraiture, Man Ray would have no
equal."10
The application was designed to be an electronic, hypermedia version of Man
Ray's Paris Portraits: 1921-39, and was developed by the Visual Computing Group
at MIT's Project Athena in order to gain insights into how to design an
electronic, visual history book based on portraiture: that is, a visual history
with a human face, or human interface. The developers began with a small visual
database of about 1000 slides, based on the portraits in the book, and added
other material available at Harvard and MIT: a section of a silent film by
Leger, Ballet Mechanique, obtained from the Harvard Film Archives; slides of
various artists' work obtained from the Rotch Visual Library at MIT's School of
Architecture and Urban Planning; and some of Stravinky's Firebird.
How would one enter the electronic book, though? Man Ray's images were the
answer. Postage-stamp size images of thirty of the portraits were arrayed on a
high-resolution screen (1280 x 1024 pixels) in a full-screen visual menu. Man
Ray's portraits of Hemingway, Picasso, Braque, Proust, Duchamp, Breton, Stein,
Stravinsky, Leger, Miro, Gris and others were thus transformed into a menu for
a database of information about themselves.
By selecting a portrait -- Leger's, for example -- the screen transforms
itself into a visual and textual catalog. A full-motion video window (640 x 480
pixels) appears with on-screen buttons for slow, fast, step, and real-time
speeds. The video clip from Ballet Mechanique is displayed along with text. Various
words and phrases referring to the film are highlighted in the text and, as
readers select them, the images appear in the video window. Other options allow
users to review Leger's paintings from the period, as well as text
descriptions.
Users may return to the main screen and select other options from the
portraits. They may select Picasso and Miro, for instance, split the screen
using the X Window manager, and review the contents of the visual files
simultaneously in order to compare paintings made during the same years. Text
descriptions are also provided. Stravinsky's file contains a text on Firebird
and, by selecting highlighted phrases, users can hear sections of the work. The
main screen also allows users to explore further information on Man Ray,
himself: paintings, photographs, written material, and, eventually, his films.
Clearly, the optimum interface for a multimedia database based on a work
such as Man Ray's Paris Portraits: 1921-39, is a visual menu. Such a visual
menu provides users with a rich visual introduction, and one that is
aesthetically appropriate for the information in the database.
Visual menus of the kind designed for the Man Ray application function as
more than menus: they also function as visual browsers, since the size of the
database, the size of the monitor, and the screen resolution make it possible
to index a large portion of the collection on a single screen. This will not be
true for the next generation of digital multimedia databases, however, for they
will contain larger amounts of information and will require multi-screen
browsers, accessible by scrolling or paging from one screen to another.
Why is it important to provide users with information about the contents of
the information space? One reason is to enable them to decide whether or not
they want to search through the database and, if so, what they can expect to
find. In a library a user can browse through stacks and easily find books; in a
video store, a consumer can get an overview of a film by taking a quick look at
the cover of a box. What will the equivalent be in the digital library, or
digital video library, of the future?
Consider book design, where similar design problems have been worked out
over several centuries. Before readers open a book, they see the title, the
author's name, and other information on the front or back: an illustration, a
photograph, a summary, or reviewers' comments. After they open the book, they
see information on the inside of the book jacket, in the table of contents, and
in the preface or introduction. They might skim the first few pages to find out
more about the content or about the author's point of view. They might flip
through the book, looking for photographs, diagrams, or other visual
information. Regardless of the particular order in which they look at the book
design or the amount of attention they give to any specific aspect of it, they
absorb enough information about the book to decide whether or not they want to
read it.
In contrast, when users look at a multimedia application, they typically see
a menu providing an overview of the principal content of the application on the
first or second screen. Few multimedia interfaces provide methods of skimming
text, flipping through stills, or scanning large amounts of video to get an
overview. Therefore, one or two screens provide almost all of the information
on content that users see before they start to use an application. This is
sufficient when the contents of a single-screen menu can index a large portion
of the database, as in the case of the visual menu for the application based on
Man Ray's portraits. But it is not sufficient when the information displayed on
a single screen cannot adequately represent the contents of the database.
Navigating through a database involves hard work: users have to make decisions
about what to see many times, not just once, as in the case of books. When
readers decide to read a book, they have no more choices to make: they start at
the beginning and read through to the end. Good writers guide them easily from
one thought or insight to the next, from one character or setting to the next.
In contrast, navigating through a multimedia information space involves making
multiple choices about what to read, view, or hear.
In the case of visual and multimedia databases, an interface can facilitate
navigation by providing users with visual browsers. Two kinds of
"compression" are required to design visual browsers: spatial
compression and temporal compression. The visual menu designed for the Man Ray
application consists of spatial compression: still images reduced to
postage-stamp size and arrayed on a screen. Still images are two-dimensional,
so this kind of two- dimensional spatial compression is sufficient. Video
consists of three-dimensional visual information, however, not two-dimensional
information. Therefore, in the case of video, temporal compression is required
as well as spatial compression. From a full-motion (thirty frames per second)
film that is two hours in length, for example, a single frame or a small number
of frames has to be selected to represent the entire film.
The Knowledge and Collaboration Machine (KCM) contains a visual browser
consisting of a set of small stills or "picons" ("picon" =
picture icon), where each of the picons represents a video clip in the
database. The browser was designed and developed by Marks, Collins, Mack, and
Malkin at IBM Research.
The goal of KCM was to create an interactive multimedia database and
authoring environment that would enable users to perform typical research and
writing tasks.11 Users can do research, for example, by searching a multimedia
database. They can retrieve and store subsets of the database to use in
composing multimedia documents, such as term papers. The initial focus of the
project was to develop this kind of environment for undergraduate engineering
students studying a problem from the real world: the expansion of the San Diego
International Airport and its impact on the surrounding communities. Later, the
target users were changed from undergraduates to professional interface
designers, and the content was changed from expansion of the San Diego
International Airport, to IBM's Common User Access (CUA) Interface Guidelines.
It is important to note that despite the change of users and content, the
design of the system changed very little, for the utilities created for the
first application were appropriate for the second one, as well.
KCM was designed as a distributed system and includes an object-oriented
database.12 Text components of multimedia objects, including starting and
stopping frame information for video clips, are stored in the database and
retrieved on demand. Video is stored on the user's workstation, using optical
laserdiscs. Although the architecture is extensible enough to allow storage of
digitized video on the database, the available network is not currently fast
enough.
In general, the time taken to transmit motion video over a network and play
it on a local workstation makes browsing expensive. Reducing the network cost,
as well as the human cost, of browsing large multimedia databases was thus one
of the objectives of the KCM architecture. A lightweight surrogate object is
associated with each multimedia object in the database. Users can browse
through surrogate object collections, requesting the real (usually very large)
object only if the surrogate object appears interesting. Much of the
architecture in KCM is aimed at making the existence of surrogate objects and
the network transparent -- the user perceives that objects, once acquired, are
local to the workstation.12,13
Picons are one class of surrogate object in KCM: small picture icons
(typically the image of a single frame) associated with a video clip. Picons
also carry publication information for the clip and a brief verbal description,
both in the form of text. The picon browser allows users to view the picons and
related text in successive "pages". Collections are usually generated
in response to user queries, but pre-selected collections may also appear as
folders on the user's "desktop."
Each page of the picon browser is similar to a visual menu, as in the Man
Ray application, but picons are self- referential: that is, they serve as a
visual index to the specific video clip they represent, not to related
multimedia information. Picons are displayed in six rows and columns, 36 to a
page. The user can navigate from page to page using buttons on the screen or
the PgUp and PgDn keys on the keyboard. Selection within a page involves the
mouse or the keyboard arrow keys. The browser also has a scrollable list of
titles: selecting a title pages the browser and positions the cursor on the
corresponding picon. When the user selects a particular picon, the title of its
associated clip is highlighted in the list, and a short verbal description of
the contents of the video clip appears in a text window in the lower left-hand
corner of the screen. Descriptions consist of two or three sentences of
information not evident from the still image: for example, information about
what occurs at a later point in the video clip.
A clip is retrieved from the laserdisc (or, in the future, the network) if
and when the user expresses interest by clicking on its picon. Typically, a
picon uses at least two orders of magnitude less storage than a video clip, and
can be retrieved and displayed in tenths of a second, versus tens of seconds.
The user can view and compare 36 picons at a time, adding an addditional order
of magnitude improvement in performance. Our experience is that "many are
called, but few are chosen" -- that is, users will inspect, and quickly
reject, many objects before finding one they wish to view in detail. Note that
even if networks could transmit video in zero time, at zero cost, the human
efficiency of this sort of browsing would still be true and would be an
important consideration in the design of a visual browser.
Picons are examples of both spatial and temporal compression: spatial
compression since they are small representations of a two-dimensional image,
and temporal compression since they are single frames selected from a
three-dimensional linear sequence of multiple frames. The short verbal
description of the contents of the video clip is another kind of compression: a
conceptual compression. This kind of design makes the browser compact, yet
highly informative. It provides an information-rich environment that can
support students in the sometimes tedious, difficult task of searching through
a database for specific information and/or video clips.
To perform the same task in the future, users will most likely need even
more information than the KCM picon browser provides. Even now it takes more
work to navigate through a multimedia application than to navigate through a
book, and the multimedia database of the future will be far more complex: more
like a library than a book. Navigating through such an information space will
be a meaningless, mindless experience unless the user can construct a coherent
and meaningful representation of the information: a cognitive macrostructure.14
To illustrate the importance of macrostructures, imagine two learning
experiments. In Experiment A, a rat is in a cage. It presses on a key, opens a
door, and learns to run through a maze. In Experiment B, a student is using a
computer. S/he clicks on a mouse, opens a window, and learns to navigate
through a database. Cognitive scientists like to think that the student is
doing something very different from the rat. But if the student is moving from
one point in an information space to another without constructing a coherent
representation of the content -- mindlessly navigating through a maze -- then
the differences may be less important than the similarities. After all, rats
are competent navigators: they can learn to navigate, just as students can. But
learning to navigate through an information space is not the same as learning
about the information space: it is certainly not the same as learning enough to
construct a conceptual macrostructure of the information space. This ability,
like the ability to recognize complex visual patterns, is uniquely human.
The theoretical basis for macrostructures has been worked out by cognitive
scientists principally in terms of discourse processing, but the theory applies
to comprehension and recall of sequences of events and images, as well as
words.15 Empirical research based on the discourse processing model shows that
people use macrostructures to help in comprehension and recall of large amounts
of information presented in text.16 This research - - what we might refer to as
human-book interaction research -- suggests that skill in constructing
macrostructures is important in the comprehension and recall of text, and that
skill in designing effective representations of macrostructure is important in
the design of usable texts. The fact that macrostructures help us understand
and reduce large amounts of information to their gist makes macrostructure
theory relevant to issues in interface design, in general; the fact that the
basic principles apply to the meaning of information and not to its form
(verbal vs. visual) makes them relevant to multimedia interface design, in
particular.
Constructing macrostructures is difficult, even in the context of linear
text. It is likely to be far more difficult in the context of interactive
multimedia, which sometimes has the characteristic of being a collection of
"factoids."17 In order to ensure that users learn as they
"cybersurf", interface designers need to find ways of providing
conceptual tools that will facilitate the construction of macrostructures. This
will be particularly important in educational or instructional contexts, where
students will be expected to learn new material by using databases in
unfamiliar domains.
Representations of macrostructure are a form of conceptual compression, and
vice versa. The verbal descriptions that accompany picons in the KCM picon
browser are, therefore, a simple form of macrostructure: a condensation. Three
kinds of compressions can be used in interface design, then, to provide users
with the maximum amount of visual information in the minimum amount of space:
spatial compression (a small still image), temporal compression (a single frame
from a film); and conceptual compression (a visual summary of conceptual
content).
Definitions are another form of macrostructure, for they provide a summary
of the meaning of a concept or category. When definitions apply to visual
categories, they can be rendered in visual form. For example, the "visual
definition" of "red" can be rendered as the color red. When
definitions apply to visual sequences of information, they can be rendered as
"visual definitional sequences". For example, definitions for the
concept of speed or for the category of time lapse photography can be rendered
more easily in visual sequences than in still images or words. Visual
definitional sequences are one of several conceptual tools that can facilitate
the construction of macrostructures in the context of interactive multimedia.
"Understanding Visualization" begins with a set of
visual definitional sequences for a set of categories that relate to the visible
attributes of objects.18 The purpose of the categories is to provide users with
a way of parsing visual information, so that they can then intelligently search
a database of visual information. Since the categories can be used for this
purpose only if users understand what they mean, they need to be defined, and
since they are categories for dynamic visual information, they are more easily
defined in visual sequences than in still images or words. The sequences were
designed and developed by Marks, based on a set of categories developed by
Davis.
The "Understanding Visualization" prototype was developed as part
of an undergraduate curriculum on visualization in the arts and sciences. The
prototype was designed to be a front-end for a database of information about
the evolution of visual rendering techniques and technologies in the arts and
sciences. What are visual rendering techniques and technologies? Consider
perspective, which is a visible attribute of objects. Knowledge of perspective
is not innate. The principles governing the construction of one-point
perspective, for example, were not discovered until the Renaissance, when they
were introduced in the art of Brunelleschi, Donatello, Masaccio, and others,
and articulated by the Italian mathematician and painter, Alberti.19 Thus,
knowledge of perspective and visual rendering techniques related to perspective
-- how to paint a three-dimensional object on a two-dimensional canvas so that
it looks three-dimensional -- has evolved over time.
Perspective is one of the four conceptual categories of visual rendering
techniques and technologies at the core of the "Understanding
Visualization" prototype. The other categories are appearance, dimension,
and measurement. In order to use these high-level concepts to understand
visualization and to intelligently search through a database of visual
information, users need to understand not only the meaning of the categories,
but their meaning in the context of visual information. The visual definitional
sequences were designed to provide this foundation of understanding. To
illustrate and "define" perspective, for example, a visual sequence
shows simultaneous, multiple perspectives on the same object. The perspective
(camera angle) shifts from outside looking in, to inside looking out, and from
right side up to upside down.
The high-level categories are divided into several sub- categories, as well:
for appearance -- shape, color, contrast, texture, and opacity; for dimension
-- time, speed, and process; for measurement -- relative size and scale; and
for perspective -- vantage point and perspective. One of the reasons for using
visual sequences, rather than still images, is that sequences can illustrate
how concepts relate to one another. To illustrate the various aspects of
appearance, for example, visual sequences show alternately changing shapes,
colors, contrasts, textures, and opacities. One sequence shows the underwater
view of a jelly fish, with varying opacities as the relative position of the
light source and the camera vary over time. Another shows a sequence of Landsat
photographs of Manhattan taken at different distances and illustrates how shape
changes with distance. To illustrate dimension, time lapse sequences show how
time, speed, and process can co-vary.
Like the visual menu for Man Ray and the visual browser for KCM, these
visual definitional sequences were designed to exploit the power of the human
visual system to recognize and reason about the visual environment, and to
provide the maximum amount of information possible in the minimum amount of
space. As the two-dimensional interface opens a window onto an increasingly
larger and more complex three- dimensional information space, the need for
highly informative, efficient visual interfaces of this kind will no doubt
increase. The current state of the art suggests that interface designers do not
yet take into account the wide spectrum of users who will need to use the
information superhighway if it is, in fact, to become a superhighway.
1. J. Sowa. "Multimedia Dialog Manager." Talk presented at IBM
Research T.J. Watson Research Center. 1993.
2. L.D. Harmon. "The Recognition of Faces." In Held, R. (ed.).
Image, Object, and Illusion. pp. 100-112. W.H. Freeman, San Francisco, CA.
1974.
3. S. Pinker. Visual Cognition. MIT Press, Cambridge, MA. 1985.
4. E.J. Farrell (ed.). IBM Journal of Research and Development 35, 1/2.
Special issue on "Visual Interpretation of Complex Data." 1991.
5. W.I. Grosky and R. Mehrotra (eds.). Computer (IEEE) 22, 12. Special issue
on "Image Database Management." 1989.
6. P. Parisi. "The Picture Exchange." Wired 2, 1. p. 32. 1994.
7. N.C. Shu. Visual Programming. Van Nostrand Reinhold, New York, NY. 1988.
8. J.R. Anderson. The Architecture of Cognition. Harvard University Press,
Cambridge, MA. 1983.
9. M.E. Hodges and R.M. Sasnett. Multimedia Computing: Case Studies from MIT
Project Athena. Addison-Wesley Publishing Company, Reading, MA. 1993.
10. Man Ray's Paris Portraits: 1921-39. Forward by T. Baum. Middendorf
Publications, Washington, D.C. 1989.
11. R. Mack, P. Malkin, D. Collins, M. Utpat, M. Laff, J. Richards, and L.
Marks. "Smalltalk prototyping of CUA 1991 workplace and multimedia
information management concepts." Research Report, IBM Research Division,
T.J. Watson Research Center (in progress).
12. P. Malkin. "The KCM Distributed Database Management System."
Research Report, IBM Research Division, T.J. Watson Research Center. 1992.
13. D. Collins. Designing Object-Oriented User Interfaces. Benjamin/Cummings
Publishing Company, Redwood City, CA. 1995 (in press).
14. L. Marks. "Semantic macrostructures and the conceptual shape of
multimedia." (Dissertation, in progress).
15. T.A. van Dijk. Macrostructures: An Interdisciplinary Study of Global
Structures in Discourse, Interaction, and Cognition. Lawrence Erlbaum
Associates, Hillsdale, NJ. 1980.
16. T.A. van Dijk and W. Kintsch. Strategies of Discourse Comprehension.
Academic Press, NY. 1983.
17. A.C. Kay. "Computers, networks, and education." Scientific
American. 138-148. September, 1991.
18. L. Marks and B. Davis. Integrative Multimedia Design. MIT Press,
Cambridge, MA. 1995 (in press).
19. L.B. Alberti. On Painting. Yale University Press, New Haven, CT. 1966 [1436