Usachev V.A., Minchuk M.V.
Donetsk National University of economics and trade named after Mikhail
Tugan-Baranovsky
Smart Media Management
À system
is called ‘smart’ if a user perceives the actions and reactions of the system
as being smart. Media is therefore managed smartly if a computer system helps a
user to perform an extensive set of operations on a large media database
quickly, efficiently, and conveniently. Such operations include searching,
browsing, manipulating, sharing, and reusing.
The more the computer knows about the media it manages, the smarter it
can be. Thus, algorithms that are capable of extracting semantic information
automatically from media are an important part of a smart media-management
system. As part of this effort, our lab at Intel Corp. (Santa Clara, CA) is
focusing on tasks such as reliable shot detection; text localization and text
segmentation in images, web pages, and videos; and automatic semantic labeling
of images.
A shot is commonly defined as the uninterrupted recording of an event or
locale. Any video sequence consists of one or more shots concatenated by some
kind of transition effects. Detecting shot boundaries thus means recovering
those elementary video units, which in turn provide the ground for nearly all
existing video abstraction and high-level video segmentation algorithms. In
addition, during video production each transition type is chosen carefully in
order to support the content and context of the video sequences; therefore,
automatically recovering all their positions and types may help the computer to
deduce high-level semantics. For instance, feature films often use dissolves to
convey a passage of time. Dissolves also occur much more often in feature
films, documentaries, and biographical and scenic video material than in
newscasts, sports, comedies, and other shows. The opposite is true for wipes,
in which a line moving across the screen marks the transition from one scene to
the next. Therefore, automatic detection of transitions and their type can be used
for automatic recognition of the video genre.
A recent review of the state-of-the-art in automatic shot boundary
detection techniques emphasizes algorithms that specialize in detecting
specific types of transitions such as hard cuts, fades, and dissolves. In a
fade, the scene gradually diminishes to a black screen for several seconds;
when a scene dissolves, it fades as the next scene becomes clearer, not to
black as a true fade does. Today’s cutting-edge systems can detect hard cuts
and fades at a high hit rate of 99% and 82% and at a low false-alarm rate of 1%
and 18%, respectively. Dissolves are more difficult to detect, and the best
approaches report hit and false-alarm rates of 75% and 16% on a representative
video test set.
Extracting truly high-level semantics from images and videos in most
cases is still an unsolved problem. One of the few exceptions is the extraction
of text in complex backgrounds and cluttered scenes. Several researchers have
recently developed novel algorithms for detecting, segmenting, and recognizing
such text occurrences.2 These
extracted text occurrences provide a valuable source of high-level semantics
for indexing and retrieval. For instance, text extraction enables users of a
video database to query for all movies featuring John Wayne or produced by
Steven Spielberg. Or it can be used to jump to news stories about a specific
topic since captions in newscasts often provide a condensation of the
underlying news story.
Detecting, segmenting, and recognizing text in nontext parts of web
pages also is a very important operation. More and more web pages present text
in images. Existing document-based text segmentation and text recognition
algorithms cannot extract such text occurrences due to their potentially
difficult background and the large variety of text color used. The new
algorithms allow users to index the content of image-rich web pages properly.
Automatic text segmentation and text recognition might also help in automatic
conversion of web pages designed for large monitors to small LCD displays of
appliances, since the textual content in images can be retrieved.
Our latest text segmentation method is not only able to locate text
occurrences and segment them into large binary images, but also to label each
pixel within an image or video whether it belongs to text or not. Thus, our
text detection and text segmentation methods can be used for object-based video
encoding. Object-based video encoding is known to achieve a much better video
quality at a fixed bit rate compared with existing compression technologies. In
most cases, however, the problem of extracting objects automatically is not
solved yet. Our text localization and text segmentation algorithms solve this
problem for text occurrences in videos. Using this technique, the multiple
video object video (multiple video object plane, or VOP) achieved a peak
signal-to-noise ratio about 1.5 dB better than the single object encoded MPEG-4
video. Thus, encoding the text lines as rigid foreground objects and the rest
of the video separately achieved a much better visual quality.
Although much research has been published on extraction of low-level
features from images and videos, only recently has the focus shifted to
exploiting low-level features to classify images and videos automatically into
semantically meaningful and broad categories. Examples of broad and
general-purpose semantic classes are outdoor versus indoor scenes and city
versus landscape scenes. In one of our media indexing research projects, we
crawled about 300,000 images from the web. After browsing carefully through
those images, we came up with broad- and general-purpose categories.
Although it uses only simple, low-level features, such as the overall
color diversity in the image, the average noise level in the images, and the
distribution of text line positions and sizes, our classification algorithm
achieved an accuracy of 97.3% in separating photo-like images from graphical
images on a large image database. In the subset of photo-like images, the
algorithm could separate true photos from ray-traced/rendered images with an
accuracy of 87.3%, while the subset of graphical images was successfully
partitioned into presentation slides and comics with an accuracy of 93.2%. Sample
images illustrating the chaos before and the order after their classification
are shown in figure 2.5 We
are now working to increase the number of categories that can be classified
automatically and will have to explore how joint classification can be done
accurately and efficiently.