Pixel Conversion with ARM NEON Intrinsics

 

Many embedded video cameras and webcams produce a pixel format known as YUYV (sometimes referred to as YUY2). However, video decoders, such as many implementations of H.264, often only support the planar YUV 4:2:0 pixel format (sometimes referred to as I420). Thus, a converter between the two formats is required. And as this converter will be processing each frame of video, it needs to be efficient. In this post, I’ll discuss the two formats and show how to efficiently convert between the two using ARM® NEON intrinsics, a technique we developed at Lanikai Labs to achieve real-time 25fps 720p H.264 encoding on a Google Coral EdgeTPU.

Background

First, what is Y, U, and V? Y stands for “luminance”, or light intensity — think grayscale television. It is a holdover from early television days when a 3D (X,Y,Z) coordinate system was used — the y-axis represented luminence in this coordinate system. U and V are orthogonal color channels:

uv-color-plane
Source: Tonyle CC BY-SA 3.0

Now let’s take a 4x4 pixel YUV 4:2:0 image as an example:

$(Y_{00}, U_{00}, V_{00})$ $(Y_{01}, U_{00}, V_{00})$ $(Y_{02}, U_{01}, V_{01})$ $(Y_{03}, U_{01}, V_{01})$
$(Y_{10}, U_{00}, V_{00})$ $(Y_{11}, U_{00}, V_{00})$ $(Y_{12}, U_{01}, V_{01})$ $(Y_{13}, U_{01}, V_{01})$
$(Y_{20}, U_{10}, V_{10})$ $(Y_{21}, U_{10}, V_{10})$ $(Y_{22}, U_{11}, V_{11})$ $(Y_{23}, U_{11}, V_{11})$
$(Y_{30}, U_{10}, V_{10})$ $(Y_{31}, U_{10}, V_{10})$ $(Y_{32}, U_{11}, V_{11})$ $(Y_{33}, U_{11}, V_{11})$

Notice how each of the 16 pixels in the image gets its own Y value, but the U and V pixels are shared amongst 2x2 groups of 4 pixels — it’s as if U and V are half the resolution of Y.

Now, for a hardware device like a CMOS image sensor, the most straighforward way to output this data is to interleave it:

$Y_{00}$ $U_{00}$ $Y_{01}$ $V_{00}$ $Y_{02}$ $U_{01}$ $Y_{03}$ $V_{01}$

But for compression, its best to have values which will be similar as a block, so a planar format is more desirable:

$Y_{00}$ $Y_{01}$ $Y_{02}$ $Y_{03}$ $U_{00}$ $U_{01}$ $V_{00}$ $V_{01}$

So some reordering is required. And this reordering needs to happen for each frame from the sensor, so it best be fast. We’ll be using ARM NEON SIMD instructions to get the speed we need.

Implementation

Since code is worth a thousand words:

//////////////////////////////////////////////////////////////////////////////
//
// Convert YUYV to YUV420P
//
// YUYV is a packed format, where luma and chroma are interleaved, 8-bits per
// pixel:
//
//     YUYVYUYV...
//     YUYVYUYV...
//     ...
//
// Color is subsampled horizontally.
//
//
// YUV420 is a planar format, and the most common H.264 colorspace. For each
// 2x2 square of pixels, there are 4 luma values and 2 chroma values. Each
// value is 8-bits; however, there are 4*8 + 8 + 8 = 48 bits total for 4
// pixels, so on average there are effectively 12-bits per pixel:
//
// YYYY...  U.U..   V.V..
// YYYY...  .....   .....
// YYYY...  U.U..   V.V..
// YYYY...  .....   .....
// .......  .....   .....
//
// Arguments:
// y:      Pointer to planar destination buffer for luma.
// yuyv:   Pointer to packed source buffer.
// stride: Stride (in bytes) of source buffer.
//
// Author: Chris Hiszpanski <chris@lanikailabs.com>
// Copyright 2019 Lanikai Labs LLC. All rights reserved.
//
//////////////////////////////////////////////////////////////////////////////

void yuyv_to_yuv420p(
    uint8_t *y, uint8_t *u, uint8_t *v,
    uint8_t *yuyv,
    int stride, int height
) {
    for (int row = 0; row < height; row += 2) {
        unpack_even(y, u, v, yuyv, stride);
        y    += stride / 2;
        u    += stride / 4;
        v    += stride / 4;
        yuyv += stride;

        unpack_odd(y, yuyv, stride);
        y    += stride / 2;
        yuyv += stride;
    }
}

Notice how odd and even rows are treated differently – since our eyes have more rods than cones, and thus have more luminance than color resolution, video frames allocate two times more space to luma than chroma.

Now for the good stuff. Here is an inline function using ARM NEON intrinsics to unpack odd numbered rows. Briefly, an intrinsic is a function which compiles to one or more specialized SIMD assembly instructions:

//////////////////////////////////////////////////////////////////////////////
//
// Unpack an odd row. Odd rows contain only luma.
//
// Arguments:
// y:      Pointer to planar destination buffer for luma.
// yuyv:   Pointer to packed source buffer.
// stride: Stride (in bytes) of source buffer.
//
// Author: Chris Hiszpanski <chris@lanikailabs.com>
// Copyright 2019 Lanikai Labs LLC. All rights reserved.
//
//////////////////////////////////////////////////////////////////////////////

inline static void unpack_odd(uint8_t *y, uint8_t *yuyv, int stride) {
#if defined(__ARM_NEON)
    for (int i = 0; i < stride; i += 32) {
        vst1q_u8(y, vld2q_u8(yuyv).val[0]);

        yuyv += 32;
        y    += 16;
    }
#else
    for (int i = 0; i < stride; i += 4) {
        *y++ = *yuyv++;
        yuyv++;
        *y++ = *yuyv++;
        yuyv++;
    }
#endif
}

I’ll leave the even rows as an exercise to the reader.

Results

The results are encouraging. Running a test on the Google Coral EdgeTPU and using libx264 for H.264 baseline encoding, without NEON instrisics the frame rate topped out at ~5 frames/sec. This is on par with the frame rate achieved by ffmpeg. Repeating the test, this time using NEON intrinsics, the frame rate topped out at over 27 frames/sec — real-time as the sensor outputs 25 frames/sec. Nearly a 6x speed up and the difference between a choppy and smooth video call.