Moving Picture Experts Group (MPEG)

The Moving Picture Experts Group (often abbreviated as MPEG) is a working group of ISO/IEC in charge of development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio, and their combination, in order to satisfy a wide variety of applications.

[edit] Introduction

H. Nyquist [1] and W. R. Bennett [2] laid the foundations of digital signal processing, the former by establishing the conditions for statistical equivalence between time-continuous and sampled signals, and the latter by setting statistical bounds to errors for quantised (so-called Pulse Code Modulation or PCM) signals, i.e. converted to a form suitable for handling by digital computing machines.

If analogue signals of primary interest to humans – audio and video – are converted to digital according to Nyquist’s and Bennett’s precepts (a process that will be henceforth called “digitisation”), very high bitrate PCM signals are obtained. Although “high” is a reflection of the technological times (the CD rate of 1.41 Mbits/s for stereo audio signals was exceedingly “high” in the early days of the internet thus prompting users to adopt the highly efficient MP3 compression format, see later), 216 Mbit/s of digital television is unmanageable even today in most open environments. This obstacle, along with the advantages to be gained by overcoming it, led to the creation of a new field of study: reduction of the bitrate of digitised audio and video signals, if possible without distortion, otherwise with a controlled distortion.

The first target application was in the speech area because of the drive started in the 1960’s to digitise the telecommunication networks and because telephone speech is from the beginning bound within the frequency spectrum of 0.3 to 3.4 kHz and therefore yields a rather reduced bitrate. Sampling at a sampling frequency of 8 kHz and 8 bits precision(companded, i.e. non-linearly quantised) gives a data rate of 64 kbit/s, as enshrined in International Telecommunication Union, Telecommunication Standardisation Sector (ITU-T) Recommendation G.711 [3]).

Various algorithms have been employed to compress speech signals. The most straightforward algorithms – DPCM (i.e. differential PCM) – were not particularly successful because of their reduced capability to compress down to 32 kbit/s – generally not enough to justify adoption of the technology in the network.

Digital video took longer to surface because the bitrate resulting from digitisation was 3 orders of magnitude larger. Still ITU-T Recommendation H.120 applied DPCM to contiguous video samples within a video frame (hence called “intraframe coding”) and achieved further reduction by exploiting correlation between contiguous frames (hence called “interframe coding”) to a subsampled version of TV signals for videoconference. Thus the input bitrate of about 40 Mbit/s could be reduced down to 1.5/2 Mbit/s. This system, too, was not particularly successful because the bitrate was still too high and the compression/decompression equipment too expensive.

In the 1980s many were working on video and audio coding. Nippon Hoso Kyokai (NHK) developed and deployed an innovative hybrid (analogue/digital) HDTV transmission system called MUSE that led the Europeans to devise their own solution called HD-MAC; ITU-T developed a new video compression Recommendation H.261 that applied intraframe Discrete Cosine Transform (DCT) coding with motion-compensated interframe prediction; RAI-Telettra and General Instrument developed and manufactured HDTV codecs at bitrates that were thought to be unachievable until then; Philips and RCA developed and manufactured systems for interactive video on compact disc (CD) called respectively CD-i and DVI; another branch of the ITU-T called CMTT studied a so called “contribution” (i.e. “between studios”) codec; a group of European companies and institutions developed the Digital Audio Broadcasting (DAB) system specifications within the Eureka project EU 147 DAB.

One might have thought that a buoyant competitive market should have been left free to produce its own results.

Instead MPEG was established as a working group of the International Organisation for Standardisation (ISO) with the idea that the only way for digital audio and video to succeed, in a relatively short time, was based on a reference standard without the myriad technological barriers that had been imposed on analogue audio and video. The right time for that standard was toward the end of the 1980s because video and audio compression performance and VLSI implementability were heading for their first intersection sometime in the early 1990s.

[edit] MPEG-1

Interactive audio and video on CD was thought to be the first business case for the standard that was eventually called MPEG-1 [5]. The standard is organised in the following five parts:

Part 1 Systems
Part 2 Video
Part 3 Audio
Part 4 Conformance testing
Part 5 Software simulation

Systems (defined in part 1 of the standard) is a packet-based multiplexer that can carry m video streams and n audio streams, all with the same time base. The stream carries timing information so that the receiving device can reconstruct a faithful replica – within the accuracy enabled by the standard – of the information generated at the encoder.

Video (defined in part 2 of the standard) provides a powerful compression technique based on the following assumptions:

Video is a 3D array (x,y,t) of image samples, referred to as “pixels” where x (<M) represents the horizontal direction of the screen, y (<N) the vertical direction (from top to bottom) and t represents the time.

Pixels are organised in 8x8 spatial blocks
Blocks are organised in 2x2 luminance blocks and 2 chrominance blocks called macroblocks
A block is mapped to sixty-four DCT coefficients
A macroblock has one motion vector
Motion vectors and DCT coefficients are Variable Length Coded

MPEG-1 Video is a generic algorithm that can work with any parameter set. As this does not give enough guidance to build interoperable devices, MPEG-1 defines a Constrained Parameter Set satisfying the following conditions

M ≤768
N ≤576
no. of macroblocks/picture ≤396 (352x288/256))
no. of macroblocks/second ≤9900 (396x25)
Picture rate ≤30 Hz
Interpolated pictures ≤2
Bitrate ≤1856 kbit/s

MPEG-1 Audio (defined in part 3 of the standard) includes three compatible versions called “layers” where

Layer I, a subband coding scheme, contains the basic mapping of the digital audio input into 32 sub-bands, fixed segmentation to format the data into blocks, a psychoacoustic model to determine the adaptive bit allocation, and quantisation using block companding and formatting;
Layer II, also a sub-band coding scheme, provides additional coding of bit allocation, scalefactors, samples, different framing;
Layer III, a hybrid sub-band-DCT coding scheme, introduces increased frequency resolution based on a hybrid filterbank; a nonuniform quantiser, adaptive segmentation and entropy coding of the quantised frequency samples are also utilized.

A “layer n” decoder is capable of decoding bistreams of lower layers but not higher layers. A reference MPEG-1 diagram is given in Figure 1.

Figure 1: MPEG-1 reference diagram.

“MPEG-1 stream decoder” is specified by Part 1, “Video decoder” is specified by Part 2 and “Audio decoder” is specified by Part 3.

Specifically, MPEG-1 standardises syntax and semantics of the bitstream. In addition, only the decoding process is subject to the standard, while the process and decoder internal data representation is non-normative.

Additionally MPEG-1 has innovated the landscape of standards by providing

The first integrated audio-visual standard with Systems, Video and Audio specification
The first audio-visual standard defining the “receiver” and not the “transmitter”
The first video coding standard independent of video format (NTSC/PAL/SECAM)
The first standard jointly developed by all industries interested in audio and video
The first standard developed entirely in software
The first standard including a software implementation.

Performance of MPEG-1 Audio, as tested in the early 1990s is transparency at 384 kbit/s (Layer I), at 256 kbit/s (Layer II) and at 192 kbit/s (Layer III) where “Transparency” was defined by MPEG as a condition where experts (so-called golden ears) are statistically unable to distinguish the original PCM stereo sound sampled at 48 kHz with 16 bits/sample from the coded version.

Early on, MPEG saw the benefit of developing a software implementation of the standard. Therefore Part 4 of the MPEG-1 standard is called “Conformance”. It provides the means to check that an instance of a decoder and that an instance of a bitstream conform to the standard.

Part 5 of MPEG-1 “Reference Software” contains the C implementation of encoders and decoders. It is to be noted that encoders are not optimised (in quality and real-time performance). However, they generate/are capable of handling conforming bitstreams. Some commercial implementations have reportedly been derived from part 5 of MPEG-1

[edit] MPEG-2

MPEG-2 [6] was designed to be the standard enabling the digital transformation of the analogue television system designed half a century before. It comprises the following 10 standards (part 8 was not developed):

Part 1 Systems
Part 2 Video
Part 3 Audio
Part 4 Conformance testing
Part 5 Software simulation
Part 6 System extensions - DSM-CC
Part 7 Advanced Audio Coding
Part 8 VOID
Part 9 System extension RTI
Part 10 Conformance extension - DSM-CC
Part 11 IPMP on MPEG-2 Systems

Systems defines an entity called Packetised Elementary Stream (PES). This is a compressed stream combined with system level information and packetised for use in two types of MPEG-2 Systems streams

Program Stream (PS) combines one or more PESs which have a common time base, into a single stream (analogous to MPEG-1 Systems Multiplex). PS is designed for use in relatively error-free environments and is suitable for applications which may involve software processing. Program stream packets may be of variable and relatively great length.
Transport stream (TS) combines one or more PESs with one or more independent time bases into a single stream. Elementary streams that share a common timebase form a program. TS is designed for use in error-prone environments, such as storage or transmission in lossy or noisy media. TS packets are 188 bytes long.

Video contains

Coding tools, i.e. particular functions required to achieve defined functionalities. Many MPEG-2 tools are drawn from MPEG-1 Video tools. Indeed if input video is progressive, one can say that MPEG-2 becomes MPEG-1. However there are also new tools, particularly for efficient compression of interlaced video and for different types of scalability (SNR, Spatial).
Profiles, i.e. groups of tools designed to satisfy major application domains while maximising interoperability between domains.

MPEG-2 Systems and Video were developed jointly with the ITU-T with the acronyms H.222 and H.262, respectively.

Audio provides a multichannel-compatible extension of MPEG-1/Audio in the sense that it is

Backward compatible: an MPEG-1/Audio decoder can decode the two channel components of an MPEG-2/Audio bitstream
Forward compatible: an MPEG-2/Audio decoder can decode an MPEG-1/Audio bitstream, of course by producing a two-channel sound.

The standard also contains technology to extend the stereo compression features of MPEG-1 Audio. Unfortunately the backward compatibility of MPEG-2 Audio with MPEG-1 Audio limits its performance.

To overcome this limitation MPEG developed part 7 Advanced Audio Coding (AAC) to provide a multichannel solution without backward compatibility of Part 3. This employs a new algorithm to encode multichannel audio, providing improved performance, that materialises as transparency (again, the use of "high quality" instead of "transparency" is recommended; this sentence also needs to be edited) at 128 kbit/s per stereo signals. The coding gain is achieved through redundancy removal by means of a high-resolution transform and entropy coding, and irrelevancy removal by using a model of the human auditory system in connection with the coefficient quantization.

In addition to Conformance and Reference Software (parts 4 and 5, respectively), MPEG-2 also includes part 6 with the title Digital Storage Media Command and Control (DSM-CC) for device-to-device and device-to-network interaction and other standards.Figure 2 illustrates the main components of the standard.

Figure 2: MPEG-2 reference diagram.

“MPEG-2 stream decoder” is specified by Part 1, “Video decoder” by Part 2, “Audio decoder” by Part 3 and “Interaction” by Part 6.

[edit] MPEG-4

MPEG-4 [7] started as a standard for very low bitrate audio-visual coding, e.g. 10 kbit/s. Eventually MPEG-4 became that and a rather long list of other digital media technologies, some of which are

Systems
Scene description
Video coding
Audio coding
3D graphics coding
Synthetic audio coding
Transport interface
File Formats
Open Font Format
Symbolic Music Representation
3D Graphics Compression Model

MPEG-4 comprises 25 parts, some of which are still under development

Part 1 Systems
Part 2 Visual
Part 3 Audio
Part 4 Conformance testing
Part 5 Reference Software
Part 6 Delivery Multimedia Integration Framework
Part 7 Optimised software for MPEG-4 tools
Part 8 4 on IP framework
Part 9 Reference Hardware Description
Part 10 Advanced Video Coding
Part 11 Scene Description and Application Engine
Part 12 ISO Base Media File Format
Part 13 IPMP Extensions
Part 14 MP4 File Format
Part 15 AVC File Format
Part 16 Animation Framework eXtension (AFX)
Part 17 Streaming Text Format
Part 18 Font compression and streaming
Part 19 Synthesized Texture Stream
Part 20 Lightweight Application Scene Representation
Part 21 MPEG-J Extension for rendering
Part 22 Open Font Format
Part 23 Symbolic Music Representation
Part 24 Audio-System interaction
Part 25 3D Graphics Compression Model

Systems (part 1) provides the architecture of the standard and roughly corresponds to the Systems parts of the MPEG-1 and MPEG-2 standards.

Visual (part 2) contains a large number of video coding tools that are employed in two very popular profiles: Simple Profile (SP) and Advanced Simple Profile (ASP).

In 2001, MPEG teamed with the Video Coding Experts Group of the ITU-T and established a Joint Video Team (JVT) which developed a new generation video codec called Advanced Video Coding (AVC) as part 10 of MPEG-4. AVC has roughly twice the compression capability of MPEG-2 and MPEG-4. Subsequently AVC was extended with scalability functions yielding Scalable Video Coding (SVC). Currently AVC is being further extended with Multiview Video Coding (MVC) capabilities.

Audio contains a large set of coding tools through which it is possible to construct several audio and speech coding algorithms

MPEG-4 AAC, an extension of MPEG-2 AAC
Twin Vector Quantisation (VQ)
Speech coding based on Code Excited Linear Predictive (CELP) coding and on Parametric representation
Spectral Band Replication (SBR) technology to provide high quality audio at ever reduced bitrate, as in High Efficiency AAC (HE AAC)
Various forms of audio lossless coding.

Synthetic Audio, called “Structured Audio”, is included in part 3. It provides the means to code sound using structured descriptions that are interpreted by a Structured Audio decoder to perform music and sound-effect synthesis. The Structured Audio Tools are: Structured Audio Orchestra Language (SAOL) providing synthesis methods, Structured Audio Score Language (SASL/MIDI) providing control parameters and Structured Audio Sample Bank Format (SASBF) providing the actual sample data.

In addition to the usual Conformance and Reference Software (parts 4 and 5, respectively), MPEG-4 also includes Part 7 “Optimised software for MPEG-4 tools” that provides examples of reference software that not just implement the standard correctly but also in optimised form, and Part 9 “Reference Hardware Description” where the reference software is in VHSIC Hardware Description Language (VHDL) for synthesis of VLSI chips.

Part 6 “Delivery Multimedia Integration Framework” (DMIF) provides a standard interface to access various transport mechanisms.

Part 8 “4 on IP framework” complements the generic MPEG-4 RTP payload defined by IETF as RFC 3640 [8].

MPEG 1 and MPEG-2 assume that information in decoded form leaves the decoder as sequences of PCM samples but the standards are silent on what is done with them. Scene Description (part 11) provides technologies for the new functionality of “composing” different information elements in a “scene”.

The original technology is called Binary Format for MPEG-4 Scenes (BIFS) of which there exists a Java powered version called MPEG-J. A newer technology with similar functionalities is provided by Part 20 “Lightweight Application Scene Representation” (LASeR).

MPEG-4 provides standard solutions for coding of synthetic visual information for 3D graphics. These tools are specified in Part 2 - Face and Body Animation and 3D Mesh Compression, Part 11 - Interpolator Compression - and 16 - a complete framework, called Animation Framework eXtension (AFX), for efficiently coding the shape, texture and animation of interactive synthetic 3D objects. AFX attempts to unify MPEG-4’s tools related to 3D graphics.

An important component of AFX is 3D Mesh Coding to provide efficient encoding of 3-D polygonal meshes with

Incremental representation: to enable a decoder to reconstruct a number of faces in a mesh proportional to the number of bits in the bit stream that have been processed.
Error resilience: to enable a decoder to partially recover a mesh when subsets of the bit stream are missing and/or corrupted.
Level of Detail (LOD) scalability: to enable a decoder to reconstruct a simplified version of the original mesh containing a reduced number of vertices from a subset of the bit stream with the advantage of reducing the rendering time of objects which are distant from the viewer (LOD management) and enabling less powerful rendering engines to render the object at a reduced quality.

AFX introduces as well an advanced animation model for articulated models, a hierarchical representation of urban environments and several modern coding tools for 3D data.

Part 25 “3D Graphics Compression Model” specifies an architectural model able to accommodate third-party eXtensible Markup Language (XML) based description of scene graphs and graphics primitives with (potential) binarisation tools and with MPEG-4 3D Graphics Compression tools.

The ISO Base Media File Format (part 12) is designed to contain timed media information for a presentation in a flexible, extensible format that facilitates interchange, management, editing, and presentation of the media. These may be ‘local’ to the system containing the presentation, or may be via a network or other stream delivery mechanism. Part 14 “MP4 File Format” extends the File Format to cover the needs of MPEG-4 scenes while part 15 “AVC File Format” supports the storage of AVC and MVC bitstreams.

Streaming Text Format (part 17 of MPEG-4) defines text streams that are capable of carrying Third Generation Partnership Program (3GPP) Timed Text (specified in 3GPP TS 26.245). To transport the text streams, a flexible framing structure is specified that can be adapted to the various transport layers, such as RTP/UDP/IP and MPEG-2 Transport and Program Stream, for use in media such as broadcast and optical discs.

Among the remaining MPEG-4 technologies it is worth mentioning Open Font Format (part 22). MPEG received a request from rights holders to convert the widely adopted OpenType specification to an ISO standard. As is the rule with MPEG standards, the OpenType specification was converted to a Working Draft and then balloted through the ISO-specified process of Committee Draft (CD), Final Committee Draft (FCD) and Final Draft International Standard (FDIS) stages.

The figure below provides a conceptual diagram of the structure of an MPEG-4 decoder with the role played by the main MPEG-4 technologies.

Figure 3: MPEG-4 reference diagram.

With reference to the figure the parts of the MPEG-4 standard specify the blocks as follows:

Part 1 specifies “MPEG-4 stream decoder”
Part 2 specifies “Video decoder”
Part 3 specifies “Audio decoder”
Part 6 specifies “Interaction”
Part 8, 12, 14 and 15 specify “Transport”
Part 11 and 20 specify “Composition decoder” and “Composition”
Part 16 specifies “3DG decoder”
Part 17 specifies “Stream text decoder”
Part 18 and 22 specify “Font decoder”
Part 19 specify “Synthesised texture decoder”
Part 21 specifies “Rendering”

[edit] MPEG-7

With MPEG-7 [9] MPEG made a kind of departure from its previous audio and video compression standards because it addressed the issue of “describing features of multimedia content”. MPEG-7 provides the world’s most comprehensive set of audio-visual description tools, namely

A set of descriptors (D) that represent features that include the syntax and semantics of the feature representation
A set of Description Schemes (DS) that specify the structure and semantics of the relationships between D and DS components
A Description Definition Language (DDL), based on XML Schema with extensions, that specifies DSs and can be used to extend and modify existing DSs
A textual and binary encoding of Ds

System tools for multiplexing of descriptors, synchronization, transmission mechanisms, file formats, etc. MPEG-7 is organised in 12 parts and is still structured in a way that reminds one of the earlier MPEG standards.

Part 1 Systems
Part 2 Description Definition Language
Part 3 Visual
Part 4 Audio
Part 5 Multimedia Description Schemes
Part 6 Reference Software
Part 7 Conformance
Part 8 Extraction and Use of MPEG-7 Descriptions
Part 9 Profiles
Part 10 Schema definition
Part 11 Profile schemas
Part 12 Query Format

Systems (part 1) specifies the means for binarising DDL data, a methodology for carrying descriptions as streams and the means for accessing and synchronously consuming data.

Description Definition Language (part 2) standardises a language to specify Description Schemes and Descriptors derived from XML Schema to express relations, object orientation, composition, partial instantiation, etc.

Visual (part 3) offers a broad range of visual descriptors

Grid layout (spatial structure)
Colour: Colour space, dominant colour, colour layout
Texture: Homogeneous texture, texture browsing
Shape: Contour-based shape
Motion: Motion activity

Audio (part 4) offers a broad range of audio descriptors

Audio Description Framework
Spoken Content DS
Timbre DS
Audio Independent Components
Melody
Sound Effects

Part 5 “Multimedia Description Schemes” (MDS) defines elements (Ds and DSs) that are generic (neither purely visual nor purely audio). This is a summary list

Basic Elements
Schema Tools
Content Description Tools
Structure Description Tools
Content Organization Description Tools
Navigation and Access Description Tools
User Interaction Description Tools.

Part 12 “Query Format” specifies the interface between a requester and a responder for multimedia content retrieval systems (e.g.: MPEG-7 databases). This enables users to describe their search criteria with a set of precise input parameters and additionally allows users to specify a set of preferred output parameters to depict the returned result sets.

[edit] MPEG-21

In 1999, much before the Web 2.0 hype, MPEG started a project driven by the vision of a future where every human on the Earth is potentially an element of a network involving billions of content providers, value adders, packagers, service providers, resellers, consumers etc. while many technologies were already available, it was clear that to make this future real there was a need for an infrastructure enabling electronic commerce of digital content.

At the basis of this project, soon called MPEG-21 [10], there are two key concepts:

Digital Item, a structured digital object with a standard representation, identification and metadata within the MPEG-21 framework and
User, any entity that interacts in the MPEG-21 environment or makes use of Digital Items.

MPEG-21 is a collection of seventeen standards whose integration enables Users to perform all functions on Digital Items that enable the realisation of the vision described above.

Part 1 Vision, Technologies and Strategy
Part 2 Digital Item Declaration
Part 3 Digital Item Identification and Description
Part 4 IPMP Components
Part 5 Rights Expression Language
Part 6 Rights Data Dictionary
Part 7 Digital Item Adaptation
Part 8 Reference Software
Part 9 File Format
Part 10 Digital Item Processing
Part 11 Evaluation Tools for Persistent Association
Part 12 Test Bed for MPEG-21 Resource Delivery
Part 13 VOID
Part 14 Conformance
Part 15 Event reporting
Part 16 Binary format
Part 17 Fragment Identification
Part 18 Digital Item Streaming

Part 1 Vision, Technologies and Strategy is a Technical Report, and lays down the scope and development plan of the project.

The foundational element of MPEG-21 is the definition of a structure that can flexibly accommodate the many components of a multimedia object. This includes, of course, the resources (media), but also identifiers, metadata, encryption keys, licenses etc. The specification of this structure is provided by Part 2 Digital Item Declaration (DID).

Identification of Digital Items is a key requirement in the digital space where everything must be uniquely and unambiguously identified in order to be managed. In MPEG-21 this function is provided by Part 3 Digital Item Identification (DII), a standard to handle identifiers in Digital Items.

A Digital Item can contain resources or even portions of a Digital Item that are protected. The component technologies that are needed to process those resources (i.e. to make them available in a form that can be processed by a machine) need to be standardised. This is done by Part 4 Intellectual Property Management and Protection (IPMP) Components. IPMP is the MPEG acronym for Digital Rights Management (DRM).

In the digital space, licenses play a similar role to licenses in the real world. The difference is that real world licences are expressed in natural language and are understood by humans, while the former must be expressed in a form that can be processed by a machine. Part 5 Rights Expression Language (REL) provides the technology to express rights in a rich form that is comparable to the richness of the human language.

The language mentioned above is only capable of expressing the syntax of a rights expression but says nothing of the semantics of the “verbs”, e.g. copy, store, display etc., that are employed by the language (even though the MPEG REL provides the semantics of a few key verbs). A standard semantics for verbs commonly used in the media environment in general is given by Part 6 Rights Data Dictionary (RDD).

When a Digital Item and its resources are transported over the network it may be necessary to “adapt” (e.g. reduce in bitrate) them to varying conditions. When a Digital Item and its resources reach a device, the resources may need to be “adapted” (e.g. subsampled) to match (e.g., device capabilities). Part 7 Digital Item Adaptation (DIA) specifies the syntax and semantics of the tools that may be used to assist in the adaptation of Digital Items, metadata and resources.

As for most other MPEG standards, MPEG-21 has a reference software implementation. This is provided by Part 8 Reference Software.

A Digital Item is an XML structure that can be moved from one device to another “as is”. However, it may be convenient to use a standard file format because in this case a device knows, by virtue of the definition of the file format itself, where specific Digital Item structures can be found. This is provided by Part 9 File Format.

A Digital Item is a static XML structure that contains all elements necessary to describe the resources contained in it, e.g. description of content, DRM information, etc. However, a Digital Item does not natively provide a way for a Digital Item creator to suggest how a user can interact with the Digital Item. Providing this additional information is the scope of Part 10 Digital Item Processing (DIP).

It is possible to establish associations – called Persistent Association Technologies (PAT) in MPEG-21 – between resources and certain metadata related to the resource using such technologies as “watermarking” and “fingerprinting”. As it is probably not necessary, and certainly premature at this stage, to standardise these association methods, Part 11 Evaluation Tools for Persistent Association provides the means to evaluate the performance of a given PAT to see how well it fulfils the requirements of the intended application. This, however, is a Technical Report, i.e. it is a simply guide to users.

A software test bed has been developed to enable experimentation with different means of resource delivery. The software is provided by Part 12 Test Bed for MPEG-21 Resource Delivery. This, however, is a Technical Report, i.e. it is simply a tool to help users experiment.

Conformance of an implementation is of course needed for MPEG-21 technologies as well. The purpose of Part 14 Conformance is to provide the necessary test methodologies and suites to be used to assess the conformity of a bitstream (typically an XML document) and a decoder (typically a parser) to the relevant MPEG-21 standard.

Certain application domains require a technology that can generate an event every time an action specified in the “Event Report Request” (ERR) contained in a Digital Item is made on a resource. The technology achieving this is specified in Part 15 Event Reporting (ER).

In MPEG-7 Systems MPEG had standardised a technology that allows the lossless conversion of a typically very bulky XML document to a binary format, preserving the ability to efficiently parse the binarised XML format. That technology has now been moved to MPEG-B Part 1 “Binary MPEG format for XML” (BiM). Now MPEG-7 Part 1 Systems and MPEG-21 Part 16 Binary format essentially reference the BiM technology specified in MPEG-B Part 1.

There are cases where it is necessary to identify a specific fragment of a resource as opposed to the entire set of data. Part 17 Fragment Identification (FID) specifies a normative syntax for URI Fragment Identifiers to be used for addressing parts of a resource from a number of Internet Media Types.

While part 9 provides a solution to transport a Digital Item in a file, Digital Items may also be transported over a streaming mechanism (e.g. in broadcasting or over IP networks). Therefore part 18 Digital Item Streaming (DIS) provides the technology to achieve this when the streaming mechanism employed is MPEG-2 Transport Stream and RTP/UDP/IP.

Media Value Chain Ontology will provide a standard representation of the terms in a vocabulary and their corresponding relationships for use in media value chains. An example is personal and commercial movies that include not only the movie itself but also related information like movie producer, movie owner, rights and limitations to modify the movie, as well as personal notes available to a certain user group. The ontology will initially focus on the areas of Intellectual Property, Authorisation Models, User Role Description, Context Description, and Social Tagging.

[edit] MPEG-A

As clear from the above list, MPEG has produced many component standards. However, technology integration has been left to implementers. The result has been that, e.g. ATSC uses MPEG-2 Systems and Video but a different Audio than specified by MPEG, and DivX uses MPEG-4 Visual, MP3 and AVI. It is obviously within the scope of implementers to make such decisions, however this has shortcomings. It may take a long time to go from an MPEG standard to a product, while gratuitous incompatibilities between different implementations that often trouble end users may could be avoided with more careful choices.

With MPEG-A [11] MPEG has decided to engage in the area of “standard integration” considering that MPEG has (most of) the technologies needed, the internal expertise to do the integration job and the appropriate industry representation.

An interesting side-effect of the integration effort is that, while doing the integration, MPEG may discover (and actually has discovered) that not all components are there.

MPEG-A is still in full development (several parts are still to be completed). It currently comprises twelve parts.

Part 1 Purpose for Multimedia Application Formats
Part 2 Music Player Application Format
Part 3 Photo Player Application Format
Part 4 Musical Slide Show Application Format
Part 5 Media Streaming Application Format
Part 6 Professional Archival Application Format
Part 7 Open Access Application Format
Part 8 Portable Video Application Format
Part 9 Digital Multimedia Broadcasting Application Format
Part 10 Video Surveillance Application Format
Part 11 Video Stereoscopic Application Format
Part 12 Interactive Music Player Application Format

Part 1 Purpose for Multimedia Application Formats is a Technical Report, and lays down the scope and development plan of the project.

Part 2 Music Player Application Format has the purpose of enabling users to achieve an augmented experience of their sound resources by providing an “extended MP3 format”. This is achieved by adding more information in the now-ubiquitous MPEG File Format, namely MP3 Audio compression, MPEG-4/MPEG-21 File Format, an ID3 subset as MPEG-7 metadata and JPEG still picture compression.

Part 3 Photo Player Application Format has the purpose of enabling users to achieve an augmented experience of their photo resources by adding more information to the ubiquitous JPEG File Format, namely

MPEG-7 Visual tools to describe visual properties of the images
MPEG-7 MDS tools to carry simple generic metadata
MPEG-7 System tools to support metadata binarisation
MPEG-4 File Format
JPEG
EXIF (EXchangeable Image format)

The Music Player Application Format was designed as a simple format for enhanced MP3 players and the Photo Player Application Format combines JPEG still images with MPEG-7 metadata. Part 4 Musical Slideshow Application Format builds on top of the Music Player and the Photo Player Application Formats and is a superset of these two MAFs.

Part 5 Media Streaming Application Format specifies how to use specific MPEG technologies to build a full-fledged media player for streaming governed content. However, in order to have a complete media streaming set-up, it is necessary to deploy a number of devices: a Content Provider Device containing the Digital Items and the actual resources; a License Provider Device containing the associated licences; an IPMP Tool Provider Device that end user devices can access to get any IPMP Tools needed to make the resources usable; a Domain Management Device that handles sets of devices and users and a Media Streaming Player. The standard specifies the data formats and the protocols exchanged between a Media Streaming Player and the other devices.

The purpose of part 6 Professional Archival Application Format is to provide a standard packaging format for carriage of digital multimedia content, metadata to describe context information related to digital multimedia content stored in the archive, metadata to describe the logical structure of how the digital multimedia content is stored in the archive, identification of processing tools that are applied to the digital multimedia content as well as data protection and integrity tools, data governance tools, and data compression tools.

Part 7 Open Access Application Format defines a format designed for users who own rights to a piece of content and have an interest in releasing it in such a way that other users can freely access it but without making it public domain. The solution is the release of content that is governed in a “light-weight” form. The Open Access Application Format packages different contents into a single container file and provides a mechanism to attach metadata information, by using MPEG-7 and MPEG-21 technologies. The MPEG-21 REL is used to model the intentions of the license. MPEG-21 Event Reporting provides a feedback mechanism, which can notify the author, when a user wants to derive a content or extract an item out of the container file.

Part 8 Portable Video Application Format defines a format for the use of video files on portable devices giving users the possibility to use the content interactively.

Digital Multimedia Broadcasting (DMB) is a specification for the digital transmission of multimedia signals (especially video services) for mobile reception. Part 9 Digital Multimedia Broadcasting Application Format defines a standard file format that can be used to store in and exchange DMB content between DMB terminals. DMB Multimedia Application Format specifies how to combine the variety of DMB contents with associated information for a presentation in a well-defined format that facilitates interchange, management, editing, and presentation of the DMB contents.

Part 10 Video Surveillance Application Format provides a lightweight wrapper to the video content from the MPEG technologies, video coding, related metadata and file format, suitable for video surveillance.

Part 11 Video Stereoscopic Application Format provides a format for a creator to take and for a service provider to distribute stereoscopic images, enabling users to have more realistic experiences (with or without special glasses) and to store the stereoscopic content for possible redistribution.

Part 12 Interactive Music Application Format defines a format to package interactive music content with audio tracks before mixing, so users can freely control the individual audio tracks. This allows the producer to create several versions (producer mixing 1, producer mixing 2, karaoke, rhythmic, and so on) with just one piece of music, using the metadata structure for mixing information.

[edit] MPEG-B

The maturing of multimedia technology is making less compelling the need to provide systems-video-audio “packages” as in previous MPEG standards (up to and including MPEG-7). Indeed various products and services currently available in the marketplace freely mix different technologies from the different standards and MPEG has done the same in its MPEG-A standards. To respond to the continuing need to cope with technological advances with new systems, video and audio standards, MPEG has started three new systems, video and audio standards “containers” called MPEG-B, MPEG-C and MPEG-D, respectively. MPEG-B [12] currently contains five parts.

Part 1 Binary MPEG format for XML
Part 2 Fragment Request Unit
Part 3 XML Representation of IPMP-X messages
Part 4 Codec Configuration Representation
Part 5 Bitstream Syntax Description Language

Part 1 Binary MPEG format for XML (BiM) provides a standard set of generic technologies to transmit and compress XML documents, addressing a broad spectrum of applications and requirements. It relies on schema knowledge between encoder and decoder in order to reach high compression efficiency, and provides fragmentation mechanisms for ensuring transmission and processing flexibility.

Part 2 Fragment Request Unit specifies a technology enabling a terminal to request XML fragments of immediate interest. This significantly reduces processing and storage requirements at the terminal and can enable applications on constrained devices that would not otherwise be possible.

Part 3 XML Representation of IPMP-X Messages provides an XML representation of the IPMP-X messages defined in MPEG-4 part 13 with extensions.

Part 4 Codec Configuration Representation provides a compressed digital representation of a video decoder and of the corresponding bitstream, assuming that the receiving terminal shares a library of video coding tools with the transmitter.

Part 5 Bitstream Syntax Description Language provides a normative grammar to describe, in XML, the high-level syntax of a bitstream. The resulting XML document is called a Bitstream Syntax Description (BSD). BSD does replace the original binary format and, in most cases, it does not describe the bitstream on a bit-per-bit basis, but rather its high-level structure, e.g., how the bitstream is organized in layers or packets of data. BSD is itself scalable, i.e. it may describe the bitstream at different syntactic layers (e.g., finer or coarser levels of detail), depending on the application.

[edit] MPEG-C

MPEG-C [13] currently contains four parts.

Part 1 Accuracy specification for implementation of integer-output IDCT
Part 2 Fixed point 8x8 DCT/IDCT
Part 3 Auxiliary Video Data Representation
Part 4 Video Tool Library

Part 1 Accuracy specification for implementation of integer-output IDCT specifies the IDCT accuracy that is equivalent to or extends the IEEE 1180 standard which has been withdrawn.

Part 2 Fixed-point 8x8 inverse discrete cosine transform and discrete cosine transform specifies a particular fixed-point approximation to the ideal 8x8 IDCT and DCT function, fulfilling the 8x8 IDCT conformance requirements for the MPEG-1, MPEG-2 and MPEG-4 part 2 video coding standards.

Part 3 Auxiliary Video Data Representation specifies how auxiliary data such as pixel-related depth or parallax values, are to be represented when encoded by MPEG video standards in the same way as ordinary picture data.

Part 4 Video Tool Library contains a collection of descriptions of video coding tools, called Functional Units, as referenced in MPEG-B Part 4.

[edit] MPEG-D

MPEG-D, formally ISO/IEC 23003 MPEG Audio Technologies, currently contains 3 parts.

Part 1 MPEG Surround
Part 2 Spatial Audio Object Coding
Part 3 Unified speech and audio coding

Part 1 MPEG Surround provides an efficient bridge between stereo and multichannel presentations in low-bitrate applications. The MPEG Surround technology supports very efficient parametric coding of multi-channel audio signals, so as to permit transmission of such signals over channels that typically support only the transmission of stereo (or even mono) signals. Moreover, MPEG Surround provides complete backward compatibility with non-multichannel audio systems.

Part 2 Spatial Audio Object Coding represents several audio objects by first combining the object signals into a mono or stereo signal, whilst extracting parameters from the individual object signals based on knowledge of human perception of the sound stage. These parameters are coded as a low bitrate side-channel that the decoder uses to render an audio scene from the stereo or mono down-mix, such that the aspects of the output composition can be decided at the time of decoding.

Part 3 Unified speech and audio coding, a standard still in the early phases of development, aims at defining a single technology that codes speech, music, and speech mixed with music, and that is consistently as good as the best of the state-of-the-art speech coders such as Adaptive Multi Rate – WideBand plus (AMR-WB+) and the state-of-the-art music coders (HE-AAC V2) in the 24 kbit/s stereo to 12 kbit/s mono operating range.

[edit] MPEG-E

MPEG-E, also called MPEG Multimedia Middleware (M3W) [14], is a complete set of standards defining technologies required in a multimedia device. It is organised in eight parts

Part 1 Architecture
Part 2 Multimedia API
Part 3 Component Model
Part 4 Resource and Quality Management
Part 5 Component Download
Part 6 Fault Management
Part 7 System Integrity Management
Part 8 Reference Software and Conformance

Part 1 Architecture describes the M3W architecture and APIs.

Part 2 Multimedia API specifies access to the functionalities provided by conforming multimedia platforms such as Media Processing Services (including coding, decoding and trans-coding), Media Delivery Services (through files, streams, messages), Digital Rights Management (DRM) Services, Access to data (e.g. media content) and Access to, Edit and Search Metadata.

Part 3 Component Model specifies a technology enabling cost effective software development and an increase in productivity through software reuse and easy software integration.

Part 4 Resource and Quality Management specifies a framework for resource management aiming to optimise and guarantee the Quality of Service that is delivered to the end-user in a situation where resources are constrained.

Part 5 Component Download specifies a download framework enabling controlled download of software components to a device.

Part 6 Fault Management specifies a framework for fault management with the goal to have a dependable/reliable system in the context of faults. These can be introduced due to upgrades and extensions out of the control of the device vendor, or because it is impossible to test all traces and configurations in today’s complex software systems.

Part 7 System Integrity Management specifies a framework for integrity management with the goal to have controlled upgrading and extension, in the sense that there is a reduced chance of breaking the system during an upgrade/extension or to provide the ability to restore a consistent configuration.

Part 8 Reference Software and Conformance is the usual complement as with the other MPEG standards.

[edit] MPEG-M

MPEG-M, also called MPEG eXtensible Middleware (MXM) [16] is a standard under development whose purpose is to promote the extended use of digital media content through increased interoperability and accelerated development of components, solutions and applications. This is achieved by specifying

1. The MXM architecture
2. The MXM components (by reference)
3. The MXM components APIs
4. The MXM applications API
5. The inter-MXM communication protocols

It is organised in three parts

Part 1 MXM Architecture and Technologies provides the reference architecture and lists the technologies that are included in the middleware,

Part 2 MXM API provides the APIs of the MXM Engines and of the MXM Orchestrator.

Part 3 MXM Reference Software and Conformance provides the MXM reference software, released as Open Source Software with a business freindly licence.

[edit] Ongoing and future activities

In its 20 years of existence MPEG has operated very much like a company churning out new products (standards) for its customers – the multimedia industry – very often by anticipating industry needs based on industry inputs and internal assessments.

These are some of the areas under investigation, at different stages of development (list in alphabetic order).

In 3D Video (3DV), a shorter time-scale sub-project, new types of audio-visual systems are supported that allow users to view videos of real 3D space from different user viewpoints. 3DV is expected to be possible with advanced 3D displays, where M dense views must be generated from a sparse set of K transmitted views (typically K≤3) with associated depth data. The allowable range of view synthesis will be relatively narrow (20 degrees view angle from leftmost to rightmost view).
Advanced IPTV Terminal is a standard being developed jointly by MPEG and ITU-T SG16 designed to enhance IPTV services by extending terminal capabilities with advanced features such as: Content generation, processing, and distribution by a large number of users; global, seamless and transparent use (regardless of geo-location, service provider, network provider and manufacturer) and diversity of user experience through easy download and installation of applications produced by a global community of developers.
In Free-viewpoinT Video (FTV), a user can set the viewpoint to an almost arbitrary location and direction, which can be static, change abruptly, or vary continuously, within the limits that are given by the available camera setup. In tandem, the audio listening point is changed to track changes in viewpoint.
Image and Video Signature Tools will be a standard supporting ultra-fast search for and identification of images/videos and their modified/edited versions, including a range of deformations, such as coding artifacts, blurring, colour-to-monochrome conversion, noise and geometric deformations such as scaling, rotation and significant cropping.
Information Exchange between Virtual Worlds (MPEG-V) will provide a standard framework enabling the interoperability between virtual worlds (i.e. virtual spaces where people can work, interact, play, travel, learn and augment real life) and aspects of the real world (sensors, actuators, social and welfare systems, banking, insurance, travel, real estate and many others).
The Presentation of Structured Information (PSI) standard will provide the means to present Structured Information, information that can e.g. be represented in XML complying to a given Schema. Presentation of this type of information, e.g. an Electronic Program Guide (EPG) in addition to audio and video is required in most service scenarios. MPEG has native Structured Information types: eXtensible MPEG-4 Textual format (XMT), LASeR, Digital Items, etc. Other forms of Structured Information have been defined by other bodies.
The Representation of Sensory Experience (RoSE) standard will add “Sensory Effects” to an audio-visual bitstream leading to more realistic experiences in the consumption of audiovisual contents. These will include special effects such as turning on a flashbulb for lightning flash effects, opening/closing window curtains for a sensation of fear effect, as well as fragrance, flame and fog can be made by scent devices, flame-throwers, fog generators, and shaking chairs.
Web, IP and Mobile TV (WIM TV) will be a standard enabling creation and distribution of rich media interactive content through some of the most promising delivery mechanisms thereby bringing the “create once publish anywhere” paradigm one step closer.

[edit] Who uses MPEG standards

Many products and services impacting the lives of millions of people are based on MPEG standard. This chapter will mention the most important.

Video CD is the precursor of the DVD. It uses MPEG-1 Systems, Video and Audio Layer II to store one hour of video on a Compact Disc.
Digital Audio Broadcasting uses MPEG-1 Audio Layer II to broadcast stereo audio via radio.
MPEG-1 Audio Layer II is also widely used in digital television set top boxes.
MPEG-1 Audio Layer III (MP3) is the quasi-universal choice for portable music.
MPEG-2 Systems (Transport Stream) and MPEG-2 Video are almost universally used for digital television set top boxes.
MPEG-2 Systems (Program Stream) and MPEG-2 Video are almost universally used for Digital Versatile Disc (DVD).
MPEG-2 Advanced Audio Coding is used in Japanese digital television set top boxes.
MPEG-4 Visual (Simple Profile) is used in most mobile handsets.
MPEG-4 Visual (Advanced Simple Profile) is used to compress video material on Compact Disc.
MPEG-4 Audio in various versions is used in many products (portable music players, mobile handsets etc.).
MPEG-4 Advanced Video Coding is being used in a broad range of products (set top boxes, mobile handsets, portable video players etc.).
MPEG-4 Binary Format for Scene (BIFS) is used in Digital Multimedia Broadcasting (DMB).
MPEG-4 File Format is used in a variety of application domains, notably to store and exchange video files taken by mobile handsets.
Elements of MPEG-4 Animation Framework eXtension (AFX) are used in mobile games.
Lightweight Application Scene Representation (LASeR) is used in mobile handsets.

Elements of MPEG-7 are used in several commercial applications and referenced by the TV Anytime specifications.

MPEG-21 Digital Item Declaration (DID) is used in commercial products.
Several elements of MPEG-21 have been adopted by the Digital Media Project (DMP) for their open source Chillout® Interoperable DRM Platform.

[edit] Conclusions

MPEG is an offspring of traditional standardisation but has continuously innovated itself to cope with evolving technology and the inflow of new industries in need of multimedia standards. Some of the innovations are the definition of bitstream syntax and decoder-only standards with the ability to allow industry to compete in encoders, the definition of profiles and levels to increase interoperability between application domains without burdening some of them with unnecessary features, the execution of subjective tests to verify the performance of the audio and video coding standards, the release of a normative reference software implementation of a decoder and an informative software implementation of an encoder.

MPEG produces standards that are deliberately kept at a generic level so as to enhance their scope of use by more industries that can share the format while independently adding the elements that are specific of their application fields in contrast to the traditional approach of industries defining vertical standards without consideration of horizontal commonalities.

MPEG provides a unique route to convert new technology into standards because of its process of selecting technologies for introduction in new standards entirely on the basis of commonly agreed technical parameters. This has the advantage that MPEG standards are typically the best technical standards in a given field but also the disadvantage that sometimes a significant number of patents may be needed to practice the standards. Patent pools are typically established to solve this problem.

[edit] References

[1] H. Nyquist, "Certain topics in telegraph transmission theory", Trans. AIEE, vol. 47, pp. 617-644, Apr. 1928

[2] W. R. Bennett, “Spectra of Quantized Signals,” Bell Syst. Tech. J., vol. 27, pp 446-472, July 1948

[3] ITU-T Recommendation G.711, Pulse code modulation (PCM) of voice frequencies

[4] ITU-T Recommendation H.120, Codecs for videoconferencing using primary digital group transmission

[5] ISO/IEC 11172, Information Technology – Coding of moving pictures and associated audio at up to about 1.5 Mbit/s

[6] ISO/IEC 13818, Information Technology – Generic coding of moving pictures and associated audio

[7] ISO/IEC 14496, Information Technology – Coding of audio-visual objects

[8] IETF Request for Comments 3640, RTP Payload Format for Transport of MPEG-4 Elementary Streams

[9] ISO/IEC 15938, Information Technology – Multimedia content description interface

[10] ISO/IEC 21000, Information Technology – Multimedia framework

[11] ISO/IEC 23000, Information Technology – Multimedia Application Format

[12] ISO/IEC 23001, Information Technology – MPEG Systems Technologies

[13] ISO/IEC 23002, Information Technology – MPEG Video Technologies

[14] ISO/IEC 23004, Information Technology – MPEG Multimedia Middleware (M3W)

[15] ISO/IEC 23005, Information Technology – Information Exchange with Virtual Worlds

[16] ISO/IEC 23005, Information Technology – MPEG eXtensible Middleware

Internal references

Tomasz Downarowicz (2007) Entropy. Scholarpedia, 2(11):3901.

Arkady Pikovsky and Michael Rosenblum (2007) Synchronization. Scholarpedia, 2(12):1459.

[edit] External links

[edit] Acronyms

3DV 3D Video
3GPP Third Generation Partnership Program
AAC Advanced Audio Coding
AFX Animation Framework eXtension
AMR-WB+ Adaptive Multi Rate – WideBand plus
ASP Advanced Simple Profile
AVC Advanced Video Coding
BIFS Binary Format for MPEG-4 Scenes
BiM Binary MPEG format for XML
BSD Bitstream Syntax Description
BSDL BSD Language
CD Committee Draft
CD Compact Disc
CELP Code Excited Linear Predictive coding
DAB Digital Audio Broadcasting
DCT Discrete Cosine Transform
DDL Description Definition Language
DIA Digital Item Adaptation
DID Digital Item Declaration
DII Digital Item Identification
DIP Digital Item Processing
DIS Digital Item Streaming
DMB Digital Multimedia Broadcasting
DMIF Delivery Multimedia Integration Framework
DMP Digital Media Project
DPCM Differential PCM
DRM Digital Rights Management
DS Description Schemes
DSM-CC Digital Storage Media Command and Control
EPG Electronic Program Guide
ER Event Reporting
ERR Event Report Request
EXIF EXchangeable Image Format
FCD Final Committee Draft
FDIS Final Draft International Standard
FID Fragment Identification
FTV Free-viewpoinT Video
HE AAC High Efficiency AAC
IDCT Inverse DCT
IETF Internet Engineering Task Force
IPMP Intellectual Property Management and Protection
IPMP-X IPMP eXtensions
ISO International Organisation for Standardisation
ITU International Telecommunication Union
ITU-T ITU, Telecommunication Standardisation Sector
JVT Joint Video Team
LASeR Lightweight Application Scene Representation
LOD Level of Detail
M3W MPEG Multimedia Middleware
MAF Multimedia Application Format
MDS Multimedia Description Schemes
MP3 MPEG Audio Layer III
MPEG Moving Picture Experts Group
MVC Multiview Video Coding
PAT Persistent Association Technologies
PCM Pulse Code Modulation
PES Packetised Elementary Stream
PS Program Stream
PSI Presentation of Structured Information
RDD Rights Data Dictionary
REL Rights Expression Language
RFC Request For Comments
RoSE Representation of Sensory Experience
RTP Real Time Protocol
SAOL Structured Audio Orchestra Language
SASBF Structured Audio Sample Bank Format
SASL Structured Audio Score Language
SBR Spectral Band Replication
SP Simple Profile
SVC Scalable Video Coding
TS Transport Stream
VHDL VHSIC Hardware Description Language
WIM TV Web, IP and Mobile TV
XML eXtensible Markup Language
XMT eXtensible MPEG-4 Textual format