Smart Media Management

Usachev V.A., Minchuk M.V.

Donetsk National University of economics and trade named after Mikhail Tugan-Baranovsky

Smart Media Management

А system is called ‘smart’ if a user perceives the actions and reactions of the system as being smart. Media is therefore managed smartly if a computer system helps a user to perform an extensive set of operations on a large media database quickly, efficiently, and conveniently. Such operations include searching, browsing, manipulating, sharing, and reusing.

The more the computer knows about the media it manages, the smarter it can be. Thus, algorithms that are capable of extracting semantic information automatically from media are an important part of a smart media-management system. As part of this effort, our lab at Intel Corp. (Santa Clara, CA) is focusing on tasks such as reliable shot detection; text localization and text segmentation in images, web pages, and videos; and automatic semantic labeling of images.

A shot is commonly defined as the uninterrupted recording of an event or locale. Any video sequence consists of one or more shots concatenated by some kind of transition effects. Detecting shot boundaries thus means recovering those elementary video units, which in turn provide the ground for nearly all existing video abstraction and high-level video segmentation algorithms. In addition, during video production each transition type is chosen carefully in order to support the content and context of the video sequences; therefore, automatically recovering all their positions and types may help the computer to deduce high-level semantics. For instance, feature films often use dissolves to convey a passage of time. Dissolves also occur much more often in feature films, documentaries, and biographical and scenic video material than in newscasts, sports, comedies, and other shows. The opposite is true for wipes, in which a line moving across the screen marks the transition from one scene to the next. Therefore, automatic detection of transitions and their type can be used for automatic recognition of the video genre.

A recent review of the state-of-the-art in automatic shot boundary detection techniques emphasizes algorithms that specialize in detecting specific types of transitions such as hard cuts, fades, and dissolves. In a fade, the scene gradually diminishes to a black screen for several seconds; when a scene dissolves, it fades as the next scene becomes clearer, not to black as a true fade does. Today’s cutting-edge systems can detect hard cuts and fades at a high hit rate of 99% and 82% and at a low false-alarm rate of 1% and 18%, respectively. Dissolves are more difficult to detect, and the best approaches report hit and false-alarm rates of 75% and 16% on a representative video test set.

Extracting truly high-level semantics from images and videos in most cases is still an unsolved problem. One of the few exceptions is the extraction of text in complex backgrounds and cluttered scenes. Several researchers have recently developed novel algorithms for detecting, segmenting, and recognizing such text occurrences.² These extracted text occurrences provide a valuable source of high-level semantics for indexing and retrieval. For instance, text extraction enables users of a video database to query for all movies featuring John Wayne or produced by Steven Spielberg. Or it can be used to jump to news stories about a specific topic since captions in newscasts often provide a condensation of the underlying news story.

Detecting, segmenting, and recognizing text in nontext parts of web pages also is a very important operation. More and more web pages present text in images. Existing document-based text segmentation and text recognition algorithms cannot extract such text occurrences due to their potentially difficult background and the large variety of text color used. The new algorithms allow users to index the content of image-rich web pages properly. Automatic text segmentation and text recognition might also help in automatic conversion of web pages designed for large monitors to small LCD displays of appliances, since the textual content in images can be retrieved.

Our latest text segmentation method is not only able to locate text occurrences and segment them into large binary images, but also to label each pixel within an image or video whether it belongs to text or not. Thus, our text detection and text segmentation methods can be used for object-based video encoding. Object-based video encoding is known to achieve a much better video quality at a fixed bit rate compared with existing compression technologies. In most cases, however, the problem of extracting objects automatically is not solved yet. Our text localization and text segmentation algorithms solve this problem for text occurrences in videos. Using this technique, the multiple video object video (multiple video object plane, or VOP) achieved a peak signal-to-noise ratio about 1.5 dB better than the single object encoded MPEG-4 video. Thus, encoding the text lines as rigid foreground objects and the rest of the video separately achieved a much better visual quality.

Although much research has been published on extraction of low-level features from images and videos, only recently has the focus shifted to exploiting low-level features to classify images and videos automatically into semantically meaningful and broad categories. Examples of broad and general-purpose semantic classes are outdoor versus indoor scenes and city versus landscape scenes. In one of our media indexing research projects, we crawled about 300,000 images from the web. After browsing carefully through those images, we came up with broad- and general-purpose categories.

Although it uses only simple, low-level features, such as the overall color diversity in the image, the average noise level in the images, and the distribution of text line positions and sizes, our classification algorithm achieved an accuracy of 97.3% in separating photo-like images from graphical images on a large image database. In the subset of photo-like images, the algorithm could separate true photos from ray-traced/rendered images with an accuracy of 87.3%, while the subset of graphical images was successfully partitioned into presentation slides and comics with an accuracy of 93.2%. Sample images illustrating the chaos before and the order after their classification are shown in figure 2.⁵ We are now working to increase the number of categories that can be classified automatically and will have to explore how joint classification can be done accurately and efficiently.