Multimodal mid-level representations for semantic analysis of broadcast video

Duan, Lingyu

Title: Multimodal mid-level representations for semantic analysis of broadcast video
Creator: Duan, Lingyu
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2008
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: This thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.
Subject: semantics; broadcast video; retrieval; sports; multimodal; commercial; context; representation
Identifier: http://hdl.handle.net/1959.13/25819
Identifier: uon:744
Language: eng
Full Text

Hits: 1470
Visitors: 1705
Downloads: 372

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Abstract	65 KB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Thesis	3 MB	Adobe Acrobat PDF	View Details Download