Yao Wang, Debargha Mukherjee
Compression is essential for efficient storage and transmission of signals. One powerful method for compression is through the application of orthogonal transforms, which convert a group of ${N}$ data samples into a group of ${N}$ transform coefficients. In transform coding, the ${N}$ samples are first transformed, and then the coefficients are individually quantized and entropy coded into binary bits. The transform serves two purposes: one is to compact the energy of the original ${N}$ samples into coefficients with increasingly smaller variances so that removing smaller coefficients have negligible reconstruction errors, and another is to decorrelate the original samples so that the coefficients can be quantized and entropy coded individually without losing compression performance. The Karhunen–Loève transform (KLT) is an optimal transform for a source signal with a stationary covariance matrix in the sense that it completely decorrelates the original samples, and that it maximizes energy compaction (i.e., it requires the fewest number of coefficients to reach a target reconstruction error). However, the KLT is signal dependent and cannot be computed with a fast algorithm.
In January 1974, Ahmed et al. published an article titled “The Discrete Cosine Transform” (DCT) [1]. This seminal article introduced a signal-independent transform, called the DCT, which uses real basis functions from the family of discrete Chebyshev polynomials. The DCT was shown, via numerical examples, to have an energy compaction performance almost as good as the KLT, superior to other well-known signal-independent transforms including the discrete Fourier transform (DFT), Haar transform, and Walsh–Hadamard transform for signals that can be modeled as a first-order Markov process with a correlation coefficient close to one. Furthermore, if the source can be modeled as a Gaussian process, the DCT leads to a rate-distortion bound similar to using the KLT, lower than the DFT. The article also showed that the ${N}$ point DCT can be obtained from the real part of a modified 2${N}$ point DFT of the zero-extended signal, and thus can be computed efficiently using the fast Fourier transform (FFT) algorithm. The basic research work and events that led to the development of the DCT were summarized in an article titled “How I Came Up With the Discrete Cosine Transform,” by Ahmed in 1991 [2], which reveals that the DCT was first conceived by him in 1972. The DCT was also introduced in a book coauthored by Ahmed and Rao [3].
In a subsequent article in 1982 [4], Flickner and Ahmed proved rigorously that the DCT can be readily derived as the limiting case of the KLT of the first-order Markov processes as the correlation coefficient approaches one. Thankfully, most real-world 1D signals or each dimension of multidimensional signals can be well modeled by a first-order Markov process, making the DCT superior to other orthogonal transforms for signal compression. Ahmed and his collaborators also explored an application of the DCT for compression of real-world signals, first for compression of electrocardiogram signals using a 1D DCT in 1975 [5], then for compression of video using a 3D DCT in 1977 [6] and, furthermore, compression of images using a 2D DCT in 1978 [7]. The article by Hein and Ahmed [7] also introduced a faster algorithm for computing the DCT through the Walsh–Hadamard transform to significantly reduce the number of multiplications.
The DCT is a special case of orthonormal transforms for finite length sequences. Let ${\boldsymbol{s}}$ denote the signal vector, consisting of ${N}$ signal samples ${s}{\left({n}\right)},\,{n} = {0},{1},{\ldots},{N}{-}{1}$, and t represent the transform vector, consisting of ${N}$ transform coefficients t(k),${k} = {0},{1},{\ldots},{N}{-}{1}$. The inverse DCT represents ${\boldsymbol{s}}$ as a weighted sum of ${N}$ basis vectors ${\boldsymbol{b}}_{k}$ \[{\boldsymbol{s}} = \mathop{\sum}\limits_{{k} = {0}}\limits^{{N}{-}{1}}{t}{\left({k}\right)}{\boldsymbol{b}}_{k}{.}\]
The elements of the basis vectors are defined by \begin{align*}{b}_{k}{\left({n}\right)} & = {\alpha}_{k}{\cos}{\left(\frac{{\left({2}{n} + {1}\right)}{k}{\pi}}{2N}\right)}, \\ {n} & = {0},{1},{\ldots},{N}{-}{1} \end{align*} \[{\text{with }} \qquad \begin{array}{l}{\alpha}_{0} = {\sqrt{\frac{1}{N}}},\,{\alpha}_{k} = {\sqrt{\frac{2}{N}}}{,} \\ \,\,\,{k} = {1}{,}{2},{\ldots},{N}{-}{1}{.} \end{array} \]
(Note that the DCT introduced in [1] has a different scaling factor ${\alpha}_{k}$. Here we choose to define ${\alpha}_{k}$ so that the basis vectors are orthonormal.) These basis vectors form an orthonormal set, and hence, the coefficient ${t}{\left({k}\right)}$ can be simply obtained by the inner product of ${\boldsymbol{s}}$ and ${\boldsymbol{b}}_{k}$ \[{t}{\left({k}\right)} = \mathop{\sum}\limits_{{n} = {0}}\limits^{{N}{-}{1}}{s}{\left({n}\right)}{b}_{k}{\left({n}\right)}{.}\]
Although the DCT is introduced for 1D signals of length ${N}$, it can be easily extended to 2D signals of dimension ${N}\,{\times}\,{N}$ by forming the basis images ${\boldsymbol{B}}_{k,l}$ using the outer product of the DCT basis vectors, i.e., ${\boldsymbol{B}}_{k,l} = {\boldsymbol{b}}_{k}{\boldsymbol{b}}_{l}^{T}$, ${k},{l} = {0},{1},{\ldots},{N}{-}{1}$, and representing an image ${\boldsymbol{S}}$ by \[{\boldsymbol{S}} = \mathop{\sum}\limits_{{k} = {0}}\limits^{{N}{-}{1}} \mathop{\sum}\limits_{{l} = {0}}\limits^{{N}{-}{1}}{t}{\left({k,l}\right)}{\boldsymbol{B}}_{k,l}{.}\]
The transform coefficients ${t}{\left({k,l}\right)}$ can be obtained by first applying the length-${N}$ 1D DCT to each row of the image and then applying the length-${N}$ 1D DCT to each column of the intermediate result, leading to the final transform image ${\boldsymbol{T}}$, also of dimension ${N}\,{\times}\,{N}$. The reverse operation can recover the original 2D image from the transform image. Similarly, the DCT can be extended to 3D or even higher dimensional signals.
The DCT can also be derived by first symmetrically extending the original length-N sequence into a length 2N sequence and then applying the 2N-point DFT. Depending on the type of symmetric extension, various versions of the DCT arise, which are sometimes numbered from DCT-I to DCT-IIX, where the original version proposed in [1] corresponds to DCT-II [8]. This relation of the DCT to the DFT allows a fast hardware or software implementation of the DCT using the butterfly principle known from the FFT, which initially helped to pave the way for the DCT’s use in practical compression algorithms.
To illustrate that the DCT has superior energy compaction property close to the KLT for natural images, we compare the average ${K}$-term approximation errors with different transforms when each 8 × 8 block of an image is represented with ${K}$ coefficients that have the largest variances. This experiment is conducted using the 25 images in the Kodak image dataset [9]. To derive the KLT transform, we determine the empirical covariance matrix from all nonoverlapping 8 × 8 image blocks (each ordered into a 1D vector of dimension 64) of all images in the dataset, and order the resulting eigenvectors in decreasing eigenvalues. For each of the other transforms including the DCT, Hadamard transform, and the DFT, we evaluated empirically the variance of each transform coefficient. Note that for the DFT, which has complex coefficients, the variance of a coefficient is the sum of the variance of the real part and that of the imaginary part. The K coefficients with largest variances include K distinct real numbers as half of the coefficients are complex conjugates of the other.
Figure 1 shows that the DCT is almost as good as the KLT in energy compaction, yielding a lower approximation error than the Hadamard transform and the DFT for the same ${K}$. Although this was demonstrated for first-order Markov processes with high correlation in the original DCT article [1], the fact that it actually holds true for natural images is reassuring. Figure 2 compares the basis images of the DCT and the KLT. As with the DCT, the KLT includes images of horizontal and vertical patterns of different frequency as well as some basis images that have mixed directions. Interestingly, the KLT also includes other directional patterns that the DCT does not (because the 2D DCT bases are obtained from the outer product of 1D bases).
Figure 1. A K-term approximation error versus K using different transforms applied on 8 × 8 image blocks. The results shown are obtained by averaging over all nonoverlapping 8 × 8 image blocks in the Kodak image dataset. The coefficients are ordered based on their empirical variances. MSE: mean-square error.
Figure 2. The basis images of (a) the DCT and (b) KLT, derived from the images in the Kodak image dataset.
Since its invention in the 1970s, the DCT has been the bedrock for image and video compression. The DCT is the basis for JPEG, the first lossy image compression format that was introduced by the Joint Photographic Experts Group (JPEG) of the International Organization for Standardization (ISO) in 1992. In a nutshell, the JPEG encoder partitions an image into small blocks and applies a 2D DCT to each block. The coefficients are then quantized and entropy coded. DCT-based image coding utilizes both the energy compaction and decorrelation properties of the DCT as well as the insensitivity of the human visual system to high-frequency components, allowing high-frequency DCT coefficients to be quantized more heavily without introducing visible distortion. JPEG greatly reduces the amount of storage required to represent an image at the cost of a relatively small reduction in image quality and has become the most widely used image file format. It also greatly reduces the bandwidth needed to upload or download an image to/from the Internet, through either a wired or wireless connection. The highly efficient JPEG compression algorithm made possible the wide proliferation of digital images and digital photos. The JPEG codec is literally embedded in all consumer cameras, smartphones, and other photographic capture devices, and enabled the interchange and sharing of photographic images on the Internet. Although a later image coding standard, JPEG 2000, used the more powerful full-frame wavelet transform, the original JPEG encoder is more hardware friendly and likely to coexist with JPEG 2000 for decades to come.
Figure 3 shows sample JPEG compressed images at different bit rates. We can see that a color image of 24 bits/pixel (bpp) can be compressed to less than 1 bpp without noticeable artifacts. At even lower bit rates, JPEG tends to produce visible blocking artifacts because the DCT is applied to each image block separately and DCT coefficients are quantized independently. Today, the intracoding method in the latest video coding standards such as high-efficiency video coding (HEVC), VP9, AV1, VVC, and EVC uses the DCT as one of several possible block transforms. The intra-only frame coding method in HEVC is one of the best performing image coding methods today, standardized by the ISO/International Electrotechnical Commission as part of the High Efficiency Image File Format and available freely as the Better Portable Graphics codec [10]. More recently, the intra-only frame coding method in AV1 was released as the AVIF image coding format by the alliance of open media.
Figure 3. (a) The original image represented using 24 bits/pixel (bpp). (Source: https://unsplash.com/photos/r-nJDGpjRic.) (b) and (c) JPEG compressed images at 0.371 bpp and 0.247 bpp, respectively. In (b), the image is compressed by a factor of 64 and yet is almost indistinguishable from the original image. In (c) (with a compression factor of 97), the image has noticeable artifacts in the cloud and water regions due to the heavy quantization of DCT coefficients, but is otherwise acceptable.
The DCT has also been the basis of all video coding standards from 1988 to the present. In these video coding standards, a video frame is either coded standalone (called intraframe) using the DCT and optionally with a few other transforms, or is coded in the predictive mode, where the frame is first predicted from the previous and possibly following frames, and then the prediction error is coded as an image using predominantly the DCT and optionally with a few other transforms. The DCT is the core technology behind all the video coding standards established by the International Telecommunications Union (ITU) and Motion Picture Experts Group (MPEG) of ISO, including H.261 (1988), MPEG-1 (1993), MPEG-2 (1995), H.263 (1998), MPEG-4 (1998), H.264/AVC (2003), H.265/HEVC (2013), and H.266/VVC (2020). The MPEG-1 video standard enabled wide distribution of low-quality movies on CDs and over the Internet in the early 1990s. The MPEG-2 video standard enabled digital video broadcasting over terrestrial, cable, and satellites, replacing analog TVs in the late 1990s as well as distribution of high-quality digital movies over DVD. The H.264/AVC video coding standard facilitated the explosion of video content over the Internet and on smartphones in the early 2000s. The H.265/HEVC video standard has enabled even higher-quality video applications since 2013 and, for example, is used for terrestrial broadcasting within Digital Video Broadcasting-Second Generation Terrestrial. The latest video standard, H.266/VVC, has focused on higher resolution and higher dynamic-range video and 360° video. Although the compression methodologies become increasingly more complex with each new standard, they all adopt the basic block-based hybrid prediction plus transform coding framework and utilize the DCT for coding the prediction error blocks.
In earlier standards (H.261, MPEG-1, MPEG-2, H.263, and MPEG-4), the DCT was exclusively used for coding the intraframes or prediction residual frames. In later standards, the DCT, along with several other transforms, are adaptively chosen to optimize coding efficiency. Specifically, the standards established in the last decade starting from H.265/HEVC (2013) have started using other discrete trigonometric transforms such as certain forms of the discrete sine transform (DST) for predictive intracoding and, to a lesser extent, intercoding, but the DCT remains the most widely used transform in all these codecs.
In parallel to the ITU/MPEG effort, the WebM project developed an open royalty-free format VP9 (2012) that also predominantly uses the DCT for intercoding, along with the DST for intracoding. Later, the Alliance for Open Media, an industrial consortium, developed AV1 (2018), a more advanced open, royalty-free video format, which also adopted the DCT as a core technology, along with certain forms of the DST. VP9 and AV1 are used in popular video streaming applications such as YouTube and Netflix.
The modified DCT (MDCT), a DCT variant, was developed by Princen et al. in 1987 [11]. The MDCT is based on DCT-IV, with the additional property of being lapped; that is, it is designed to be applied to overlapped blocks of samples in a long sequence so that the last half of one block coincides with the first half of the next block. This overlapping helps suppress artifacts stemming from the block boundaries. The MDCT is employed in most of the modern audio coding standards, including MP3, Dolby Digital (AC-3), Advanced Audio Coding, Dolby AC-4, and MPEG-H 3D Audio.
Apart from its main application in compression, the DCT has also found its way into other applications of digital signal processing, leveraging its frequency-based signal representation. An example of such an application is image denoising using the BM3D [12] algorithm, which applies a 3D DCT on the 3D block formed by similar image blocks in an image, and applies soft thresholding on the resulting DCT coefficients. The DCT has also played an important role in advancing automatic speech recognition and automatic speaker recognition, where the DCT has been used to compute mel-frequency cepstral coefficients, the features used for the recognition algorithms [13].
Today, people take for granted that they can capture and edit high-quality images and videos and freely share them with their friends and families, and that they can watch high-quality images and movies from almost anywhere. During the COVID-19 pandemic, various video conferencing platforms served as essential tools for family members and friends to connect with each other, and for corporate and education entities to conduct remote work and learning. Behind all these applications lie video codecs that utilize the DCT. The societal and economic impact of the DCT is immeasurable! The impact of the DCT is beautifully explained in an February 2021 episode of the TV series “This is Us,” which showed that families were able to share the important milestones of their lives during the pandemic due to the invention of the DCT by Ahmed [14].
The authors would like to thank Prof. Sanjit Mitra (University of California, Santa Barbara) for encouraging and guiding them to prepare this article. The authors would like to thank Zhongzheng Yuan (New York University Tandon School of Engineering) for generating the experimental results in Figures 1–3.
Yao Wang (yaowang@nyu.edu) received her Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1990. She is a professor in the Department of Electrical and Computer Engineering and the Department of Biomedical Engineering, New York University Tandon School of Engineering, New York, NY 11201 USA. Her research interests include video compression and delivery, and computer vision and medical image analysis. She is a Fellow of IEEE.
Debargha Mukherjee (debargha@google.com) received his Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1999. He is a principal engineer at Google, Mountain View, CA 94043 USA, where he leads open next-generation video codec development. His research interests include conventional and learning-based image and video compression, and related areas. He is a Fellow of IEEE.
[1] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Comput., vol. C-23, no. 1, pp. 90–93, Jan. 1974, doi: 10.1109/T-C.1974.223784.
[2] N. Ahmed, “How I came up with the discrete cosine transform,” Digit. Signal Process., vol. 1, no. 1, pp. 4–5, Jan. 1991, doi: 10.1016/1051-2004(91)90086-Z.
[3] N. Ahmed and K. R. Rao, “Orthogonal transforms for digital signal processing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Philadelphia, PA, USA: Springer-Verlag, 1975, pp. 136–140, doi: 10.1109/ICASSP.1976.1170121.
[4] D. Flickner and N. Ahmed, “A derivation for the discrete cosine transform,” Proc. IEEE, vol. 70, no. 9, pp. 1132–1134, Sep. 1982, doi: 10.1109/PROC.1982.12439.
[5] N. Ahmed, P. J. Milne, and S. G. Harris, “Electrocardiographic data compression via orthogonal transforms,” IEEE Trans. Biomed. Eng., vol. BME-22, no. 6, pp. 484–487, Nov. 1975, doi: 10.1109/TBME.1975.324469.
[6] T. Natarajan and N. Ahmed, “On interframe transform coding,” IEEE Trans. Commun., vol. 25, no. 11, pp. 1323–1329, Nov. 1977, doi: 10.1109/TCOM.1977.1093769.
[7] D. Hein and N. Ahmed, “On a real-time Walsh-Hadamard/Cosine Transform image processor,” IEEE Trans. Electromagn. Compat., vol. EMC-20, no. 3, pp. 453–457, Aug. 1978, doi: 10.1109/TEMC.1978.303679.
[8] S. A. Martuci, “Symmetric convolution and the discrete sine and cosine transforms,” IEEE Trans. Signal Process., vol. 42, no. 5, pp. 1038–1051, Mar. 1994, doi: 10.1109/78.295213.
[9] KODAK Image Dataset Jan. 27, 2013. [Online] . Available: https://r0k.us/graphics/kodak/
[10] S. Anthony. “BPG: A new, superior image format that really ought to kill off JPEG.” ExtremeTech. Accessed: Dec. 12, 2014. [Online] . Available: https://www.extremetech.com/computing/195856-bpg-a-new-superior-image-format-that-really-ought-to-kill-off-jpeg
[11] J. P. Princen, A. W. Johnson, and A. B. Bradley, “Subband/Transform coding using filter bank designs based on time domain aliasing cancellation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1987, vol. 12, pp. 2161–2164, doi: 10.1109/ICASSP.1987.1169405.
[12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16, no. 8, pp. 2080–2095, Aug. 2007, doi: 10.1109/TIP.2007.901238.
[13] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun., vol. 54, no. 4, pp. 543–565, May 2012, doi: 10.1016/j.specom.2011.11.004.
[14] “‘This Is Us’ honored a real-life ‘Genius’ who made it possible for the pearsons to connect amid COVID,” People, Feb. 16, 2021. [Online] . Available: https://people.com/tv/this-is-us-nasir-ahmed-pearsons-births-covid/
Digital Object Identifier 10.1109/MSP.2023.3282775