38 New Horizon in Mobile Development Android Audio and Video Development

38 New Horizon in Mobile Development Android Audio and Video Development #

Hi, I’m Zhang Shaowen. Junjie is currently responsible for the audio-video development of WeChat. Whether it’s the short videos we often use or the newly launched “In a Moment” videos, they are all his work. Junjie has rich experience in audio-video development and today we are fortunate to have him here to share his personal insights and experiences in Android audio-video development.

In our daily lives, video applications are taking up more and more of our time, and major companies are also entering this battlefield one after another. Whether it’s short-video apps like Douyin and Kuaishou, live-streaming apps like Huya and Douyu, long-video apps like Tencent Video, iQiyi, and Youku, or video editing and beauty apps like Vue and Meitu, there’s always one that suits you.

In the future, with the popularity of 5G and the decrease in network fees, the prospects for audio-video are very promising. However, on the other hand, whether it’s video encoding and decoding, player development, video editing and beauty algorithms, or the integration of video and artificial intelligence (AI video editing, video restoration, high-definition enhancement, etc.), they all involve various underlying knowledge. The learning curve is steep and the barrier to entry is relatively high, which has led to a shortage of audio-video talent in major companies. If you are interested in audio-video development, I highly recommend you give it a try. Personally, I have very high expectations for the audio-video development field.

Of course, experience in audio-video development is gained through countless “troubleshooting” experiences. Now let’s take a look at Junjie’s understanding and thoughts on audio-video together.

Hello everyone, I’m Zhou Junjie, currently working on Android multimedia development at WeChat. At the invitation of Shaowen, I would like to share some of my insights and experiences in Android audio-video development.

Whether as developers or users, we encounter various short video and live streaming apps every day, making audio-video development increasingly important. However, for most Android developers, engaging in Android audio-video development may still be considered a niche field. Although the number of developers deeply involved in this field may still be small, the knowledge involved in this direction is by no means trivial.

Basics of Audio and Video #

1. Concepts Related to Audio and Video

When it comes to audio and video, let’s start with video formats that are both familiar and unfamiliar to us.

The most common video format for us is the MP4 format, which is a universal container format. A container format means that it needs corresponding data streams to carry the content. And since it is a video, it definitely has an audio track and a video track, and these tracks themselves also have corresponding formats. Common video track and audio track formats include:

Video track: H.265(HEVC), H.264. Among them, most Android phones support direct hardware encoding and decoding of the H.264 format; as for H.265, machines running Android 5.0 or above support direct hardware decoding, but only some high-end chips like Qualcomm’s 8xx series and Huawei’s 98x series support hardware encoding. For video track encoding, the larger the resolution, the higher the performance consumption and the longer the required encoding time.
Audio track: AAC. This is an audio encoding format with a long history, and most Android phones can directly hardware encode and decode it, with almost no compatibility issues. It can be said that AAC is already very mature as a audio track format for video.

As for encoding itself, all the formats mentioned above are lossy encoding, so compressing the encoding itself requires a measure of how much data is left after compression, which is the bit rate. Under the same compression format, the higher the bit rate, the better the quality. For more information about the encoding and decoding formats supported by Android itself, you can refer to the official documentation.

In summary, to shoot an MP4 video, we need to encode the video track and audio track separately, and then combine them into an MP4 file as the data stream of the MP4 file.

2. Process of Audio and Video Encoding

Next, let’s take a look at how a video is captured. First of all, since it is a capture, it involves interacting with a camera and a microphone. In terms of the process, taking H.264/AAC encoding as an example, the overall process of recording a video is as follows:

We collect data from the camera/recording device, feed the data into the encoder, encode the video track and audio track separately, and then feed them into a multiplexer (such as MediaRemuxer or processing libraries like mp4v2, FFmpeg), which will output the final MP4 file. Next, I will mainly focus on the video track and explain the encoding process. Firstly, the simplest method to record the entire video is to directly use the system’s MediaRecorder, which can output an MP4 file directly. However, this interface is not highly customizable. For example, if we want to record a square video, we can only achieve that by post-processing or various hacks unless the camera itself supports resolutions with consistent width and height. Additionally, in practical applications, unless the video quality requirements are not very high, MediaRecorder is generally not used directly.

The processing of video tracks is a relatively complex part of video recording. The input source is the data from the camera, and the output is encoded H.264/H.265 data. I will now introduce two processing models.

The first method is to use the Camera’s API to obtain the raw data output from the camera (e.g., onPreviewFrame). After preprocessing, such as scaling and cropping, the data is sent to the encoder to output H.264/H.265.

The raw data format output by the camera is NV21, which is a type of YUV color format. Different from RGB colors, YUV data format occupies less space and is widely used in video encoding.

Generally, because the NV21 format directly output by the camera may not match the final video size, and the encoder often requires another YUV format as input (generally YUV420P), various operations such as scaling and cropping are needed after obtaining the NV21 color format. Libraries such as FFmpeg and libyuv are commonly used to process YUV data.

Finally, the data is sent to the encoder. For video encoder selection, we can directly choose the system’s MediaCodec to leverage the phone’s hardware encoding capabilities. However, if the video size requirements are strict, the bitrate used may be low, and the video quality output by most hardware encoders may be relatively poor. Another common option is to use x264 for encoding, which has comparatively better video quality but is much slower than hardware encoders. Therefore, it is advisable to choose according to the specific scenario in actual use.

In addition to directly processing the raw data from the camera, there is another common processing model that utilizes a Surface as the input source for the encoder.

For previewing the Android camera, you need to set a Surface/SurfaceTexture as the output for the camera’s preview data. Starting from API 18, MediaCodec allows creating an InputSurface through createInputSurface to serve as the input for the encoder. Another approach is to output the content of the preview Surface of the camera to the InputSurface of MediaCodec.

When it comes to encoder selection, even though InputSurface is created through MediaCodec, seemingly only making use of MediaCodec for encoding, it is still possible to use x264 for encoding by creating an OpenGL context with PreviewSurface. This way, all drawn content can be obtained through glReadPixel, converted to YUV, and input to x264 (Furthermore, if it is in GLES 3.0 environment, we can also use PBO to accelerate the speed of glReadPixels).

Since we create an OpenGL context here, for current video apps, various filters and beauty effects can also be implemented based on OpenGL.

Regarding the specific implementation code for recording videos using this approach, you can refer to the example in Grafika.

Video Processing #

1. Video Editing

In current video apps, you can find various video cropping and editing features, such as:

Cropping a part of the video.
Concatenating multiple videos.

For video cropping and concatenation, Android provides the interface MediaExtractor, which, combined with the seek interface and reading frame data interface readSampleData, allows us to directly obtain the content of the frame with the corresponding timestamp. The data obtained this way is already encoded, so there is no need to re-encode it. It can be directly input into the composer for re-synthesis into MP4.

We only need to seek to the timestamp of the original video that needs to be cropped, and then continuously read sampleData and input it into MediaMuxer. This is the simplest implementation of video cropping.

However, in practice, it is found that seekTo does not work for all timestamps. For example, let’s say we have a video that is about 4 minutes long and we want to seek to a position around 2 minutes, and then read the data from this position. But when we actually call seekTo to the 2-minute position and then read the data from MediaExtractor, we may find that the actual data obtained may be slightly before or after the 2-minute position. This is because the MediaExtractor interface can only seek to the position of a key frame in the video, and the position we want does not necessarily have a key frame. This problem comes back to video encoding, as there is a certain interval between two key frames during video encoding.

As shown in the above figure, key frames are called I frames, which can be regarded as frames without compression, and do not rely on other video frames to be decoded. However, between two key frames, there are compressed frames like B frames and P frames, which need to rely on other frames to decode a complete picture. As for the interval between two key frames, it is called a GOP (Group of Pictures), and the frames within a GOP cannot be directly seeked by MediaExtractor, as this class is not responsible for decoding and can only seek to key frames before and after. However, if the GOP is too large, it will make video editing very inaccurate (in fact, some ROMs of mobile phones have been modified and the implemented MediaExtractor can perform accurate seeking).

In this case, to achieve precise cropping, we have to rely on the decoder. The decoder itself can decode the contents of all frames, so after introducing frame decoding, the entire cropping process becomes as follows.

We need to first seek to the key frame before the desired position and then input it into the decoder. After decoding a frame, we check whether the PTS (Presentation Timestamp) of the current frame is within the desired timestamp range. If so, we input the data into the encoder, re-encode it to obtain H.264 video track data again, and then merge it into an MP4 file.

The above is the basic video cropping process. For video concatenation, it is also necessary to obtain multiple segments of H.264 data and input them into the composer together.

In addition, in actual video editing, we also add many video effects and filters. In the previous video shooting scenario, we used a Surface as the input source of MediaCodec and created an OpenGL context using Surface. When MediaCodec is used as a decoder, a Surface can also be specified as the output for decoding during the configure process. Most video effects can be implemented using OpenGL, so the general process for implementing video effects is as follows.

We hand over the rendering after decoding to OpenGL, and then output it to the InputSurface of the encoder to implement the whole encoding process.

2. Video Playback

Any video app involves video playback, which includes recording, editing, and playback to provide a complete video experience. The simplest way to play an MP4 file is to directly use the system’s MediaPlayer. With just a few lines of code, you can play the video directly. For local video playback, this is the simplest implementation method. However, in reality, we may have more complex requirements:

The videos that need to be played may not be stored locally. Many of them may be online videos, requiring downloading and playing at the same time.
The videos to be played may be part of video editing, requiring real-time preview of video effects during editing.

For the second scenario, we can simply configure the view for playing videos as a GLSurfaceView. With the OpenGL environment, we can implement various effects and filters on it. For common playback configurations in video editing, such as fast-forward and rewind, MediaPlayer also provides direct interfaces to set them.

The first scenario is more common, such as a video streaming interface, where most videos are online videos. Although MediaPlayer can also play online videos, there are two problems in actual use:

Videos downloaded through the MediaPlayer video URL are placed in a private location that the app cannot directly access. This means we cannot preload videos, and we cannot reuse the previously played and buffered videos.
As with using MediaExtractor for video editing, MediaPlayer also cannot accurately seek, and can only seek to key frames.

For the first problem, we can solve it by using a video URL proxy download. We can use a Local HTTP Server to download the videos to a specified location. There are already mature open source projects in the community that can achieve this, such as AndroidVideoCache.

As for the second problem, the inability to seek accurately can be fatal for some apps, and the product may not accept such a user experience. In this case, we can only implement the player ourselves based on MediaCodec, just like video editing. This part is more complex. Of course, you can also directly use the open source ExoPlayer from Google, which is simple and fast, and also supports setting online video URLs.

It seems that there are solutions to all problems, so everything should be fine, right?

The most common format for streaming and downloading videos is MP4. However, when some videos are uploaded directly to a server, whether using MediaPlayer or ExoPlayer, it seems that the videos can only be played after the entire video is downloaded, without achieving the experience of downloading and playing at the same time. The reason for this problem is actually due to the format of MP4, specifically, it is related to “moov” in the MP4 format.

In the MP4 format, there is a place called “moov” that stores the metadata of the current MP4 file, including the format of the audio and video tracks, video length, playback rate, offset of the key frames in the video track, and other important information. When playing an MP4 file online, the decoding of the audio and video tracks requires the information in “moov”.

The reason for the aforementioned problem is that when “moov” is at the end of the MP4 file, the player does not have enough information for decoding, so the video can only be decoded and played after it is completely downloaded. Therefore, to achieve downloading and playing at the same time for MP4 files, “moov” needs to be placed at the beginning of the file. Currently, there are mature tools in the industry that can do this, such as FFmpeg and mp4v2. For example, using FFmpeg, the command is as follows:

ffmpeg -i input.mp4 -movflags faststart -acodec copy -vcodec copy output.mp4

With the -movflags faststart option, we can move the “moov” in the video file to the beginning.

In addition, if you want to check whether the “moov” of an MP4 is placed at the beginning, you can use tools similar to AtomicParsley to do so.

In the practice of video playback, besides MP4, there are many other formats used for downloading and playing at the same time, such as m3u8 and FLV. Common implementations in the client side include ijkplayer and ExoPlayer. Interested students can refer to their implementations.

The Journey of Learning Audio and Video Development #

Audio and video development covers a wide range of areas. Today, I will provide a brief introduction to the basic architecture. If you want to further develop in this field, based on my personal learning experience, in addition to the basic Android development knowledge, it is necessary to delve into the following technology stack.

Languages

C/C++: As audio and video development often involves dealing with underlying code, mastering C/C++ is a necessary skill. There are many resources available on this topic, and I believe we can all find them.
ARM NEON Assembly: This is an advanced skill used in video encoding/decoding and various frame processing. Many processes are accelerated using NEON Assembly, such as FFmpeg/libyuv. Although it is not a required skill, it is worth getting acquainted with. You can refer to this tutorial in the ARM Community for more information.

Frameworks

FFmpeg: It can be said that FFmpeg is the most famous audio and video processing framework in the industry, encompassing almost all processes of audio and video development. It is a must-have skill.
libyuv: This is a YUV frame processing library open-sourced by Google. Since camera output and encoding/decoding input/output are based on the YUV format, this library is often used to manipulate data. FFmpeg also provides similar implementations in libswscale. However, this library has better performance and is accelerated by NEON Assembly.
libx264/libx265: These are currently the most widely used H.264/H.265 software encoding/decoding libraries in the industry. Although hardware encoding can be used on mobile platforms, software encoding is still preferred in many cases for compatibility or image quality considerations, especially on low-end Android devices and in low bitrate scenarios.
OpenGL ES: Currently, most video effects and beauty algorithms are implemented using GLES for rendering. Therefore, to delve into audio and video development, knowledge of GLES is essential. In addition to GLES, Vulkan is a higher-performance graphics API that has emerged in recent years, but its usage is not yet widespread.
ExoPlayer/ijkplayer: For a complete video app, video playback is essential. These two libraries are currently the most commonly used video players in the industry, supporting various formats and protocols. If you want to delve into video playback processing, they are almost essential skills.

Based on the practical needs, we can delve into learning from the following two paths based on the above technology stack.

1. Development of Video Effects

There are more and more live streaming and short video related apps. Almost all effects in these apps are implemented using OpenGL itself. For some simple effects, techniques like Color Look Up Table can be used to replace colors by modifying the material and working with shaders. If you want to learn more complex filters, I recommend studying and referring to shadertoy, where you can find numerous shader examples.

For beauty and cosmetic effects, especially cosmetic effects, facial recognition is required to obtain key points. The face texture is then divided into triangles, and the corresponding key points are magnified or shifted in the shader. If you want to delve into video effects development, I recommend learning more about OpenGL-related knowledge, as there are many optimization points involved.

2. Video Encoding and Compression Algorithms

H.264/H.265 are very mature video encoding standards. To minimize video size while maintaining video quality in order to save bandwidth, a deep understanding of these video encoding standards is required. This may be a relatively high barrier. I am also in the learning stage, so interested students can read documentation related to encoding standards.

You are welcome to click on “Share with a friend” to share today’s content with your friends and invite them to learn together. I have also prepared a generous “Learning Booster Package” for students who think deeply and share actively. I look forward to learning and making progress together with you.