04 Creating Container Images How to Write an Effective Dockerfile

04 Creating Container Images How to Write an Effective Dockerfile #

Hello, this is Chrono.

In the last lesson, we learned about containerized applications, which are packaged as images, and how to run and manage them using various Docker commands.

This raises a question: How are these images created? Can we create our own images?

So today, I will explain the internal mechanism of images and teach you the proper and efficient way to write Dockerfile to create container images.

What is the internal mechanism of an image? #

As you may know now, an image is a package file that contains an application and the environment it depends on to run, such as the file system, environment variables, configuration parameters, and so on.

Environment variables and configuration parameters are relatively simple to manage with a manifest file, but the file system is the real challenge. To ensure consistency in the container runtime environment, the image must include the root file system, also known as rootfs, of the operating system where the application resides.

Although these files do not include the system kernel (as containers share the host machine’s kernel), if each image repeats the packaging operation, it would still result in a significant amount of redundancy. Imagine if there are a thousand images all based on the Ubuntu system, then each of these images would duplicate the Ubuntu root directory a thousand times, which would be wasteful in terms of disk storage and network transmission.

Naturally, we would think about extracting the repeated parts and only storing one copy of the Ubuntu root directory file, then allowing these thousand images to share this data in some way.

This idea is a major innovation of container images: layering.

The internal structure of a container image is not flat, but composed of many image layers. Each layer is a set of files that are read-only and cannot be modified. Identical layers can be shared between images, and multiple layers are stacked together like building blocks. Then, a technique called “Union FS (Union File System)” is used to merge them together, forming the file system seen by the container in the end (image source).

Let me use a familiar example of a thousand-layer cake to illustrate this imagery.

A thousand-layer cake is also made up of many layers stacked together. From the top, you can see the raisins, walnuts, almonds, and threads embedded in each layer. Each layer of the cake is like a layer in an image, and the dried fruits are like the files in each layer. However, if the same dried fruit is present in the same position in two layers, meaning there are files with the same name, we can only see the file in the upper layer, while the file in the lower layer is hidden.

You can use the command docker inspect to view the layering information of an image. For example, the nginx:alpine image:

docker inspect nginx:alpine

The layering information is in the “RootFS” section:

From this screenshot, you can see that the nginx:alpine image has a total of 6 layers.

I believe now you understand what those “strange” output messages were when using commands like docker pull and docker rmi on images. In fact, those were the individual layers of the image. Docker checks for duplicate layers, and if they already exist locally, it will not download them again. If a layer is shared by other images, it will not be deleted, which helps save disk and network costs.

What is Dockerfile #

Now that we know the internal structure and basic principles of container images, we can learn how to create our own container images, which means packaging our own applications.

As we mentioned when we talked about containers before, containers are like “small houses” and images are like “showrooms”. To create this “showroom”, we need a “blueprint” that specifies how to build the foundation, install utilities, and perform other actions. This “blueprint” is called a “Dockerfile”.

Compared to containers and images, Dockerfile is very ordinary. It is a plain text file that contains a series of build instructions, such as selecting a base image, copying files, running scripts, etc. Each instruction will generate a layer, and Docker will execute all the steps in this file sequentially to create a new image.

Let’s take a look at a simple Dockerfile example:

# Dockerfile.busybox
FROM busybox                  # Select a base image
CMD echo "hello world"        # Default command to run when starting the container

This file contains only two instructions.

The first instruction is FROM, which starts every Dockerfile, and it indicates the base image to build from, like “laying the foundation”. Here, we are using busybox as the base image.

The second instruction is CMD, which specifies the default command to run when starting the container with docker run. Here, we are using the echo command to output the string “hello world”.

Now that we have this “blueprint” called Dockerfile, we can bring in the “construction team” and use the docker build command to create the image:

docker build -f Dockerfile.busybox .

You need to pay attention to the command format. Use the -f parameter to specify the Dockerfile filename, and it must be followed by a file path, called the “build context”. Here, we simply use a dot to represent the current path.

Next, you will see Docker reading and executing the instructions in the Dockerfile line by line, creating image layers one by one, and then generating the complete image.

The new image does not have a name yet (you will see <none> when using docker images), but we can use the “IMAGE ID” directly to view or run it:

docker inspect b61
docker run b61

How to write correct and efficient Dockerfiles #

After getting a general understanding of Dockerfile, let me talk about some common instructions and best practices for writing Dockerfiles, which will help you write and use them effectively in your future work.

Firstly, the choice of the base image is crucial because the first instruction in the image building process must be FROM. If you are concerned about image security and size, you generally choose Alpine. If you prioritize application stability, you might choose Ubuntu, Debian, or CentOS.

FROM alpine:3.15                # Choose Alpine image
FROM ubuntu:bionic              # Choose Ubuntu image

During local development and testing, some source code, configuration files, etc., need to be packaged into the image. In this case, you can use the COPY command, which is similar to the cp command in Linux. However, the source files to be copied must be within the “build context” path and cannot be arbitrarily specified. This means that if you want to copy files from your local machine to the image, you must place these files into a dedicated directory and specify the “build context” to this directory in the docker build command.

Here are two examples of the COPY command for your reference:

COPY ./a.txt  /tmp/a.txt    # Copy a.txt from build context to /tmp in the image
COPY /etc/hosts  /tmp       # Error! Cannot use files outside the build context

Next, let’s talk about the most important instruction in a Dockerfile, which is RUN. It can execute any shell command, such as updating the system, installing applications, downloading files, creating directories, compiling programs, etc., enabling arbitrary steps in the image building process, making it very flexible.

RUN is usually the most complex instruction in a Dockerfile and may contain many shell commands. However, each instruction in the Dockerfile must be a single line, so some RUN instructions use line continuation character \ at the end of each line and use && to connect commands between lines to ensure they logically form a single line. Here is an example:

RUN apt-get update \
    && apt-get install -y \
        build-essential \
        curl \
        make \
        unzip \
    && cd /tmp \
    && curl -fSL xxx.tar.gz -o xxx.tar.gz\
    && tar xzf xxx.tar.gz \
    && cd xxx \
    && ./config \
    && make \
    && make clean

Sometimes, writing such long RUN instructions in a Dockerfile is not visually appealing, and it can be cumbersome to rebuild the image every time there is an error during debugging. In this case, you can use a workaround: consolidate these shell commands into a script file, copy it into the image using the COPY command, and then execute it using RUN:

COPY setup.sh  /tmp/                # Copy the script to /tmp directory

RUN cd /tmp && chmod +x setup.sh \  # Add execute permission
    && ./setup.sh && rm setup.sh    # Run the script and then delete it

RUN instruction is essentially shell programming. If you are familiar with it, you would know about variables, and the same concept can be applied in Dockerfile using two instructions: ARG and ENV.

The difference between them is that the variable created using ARG is only visible during the image building process and not in the running container. On the other hand, the variable created using ENV can be used both during image building and in the running container as an environment variable.

Here is a simple example using ARG to define the base image name (which can be used in the “FROM” instruction) and using ENV to define two environment variables:

ARG IMAGE_BASE="node"
ARG IMAGE_TAG="alpine"

ENV PATH=$PATH:/tmp
ENV DEBUG=OFF

Another important instruction is EXPOSE, which is used to declare the port number that the container’s services will be exposed to. This is particularly useful for microservice systems based on Node.js, Tomcat, Nginx, Go, etc.:

EXPOSE 443           # By default, it is TCP protocol
EXPOSE 53/udp        # UDP protocol can also be specified

After discussing these Dockerfile instructions, I want to emphasize that you should avoid overusing instructions in Dockerfile because each instruction creates a layer in the image. It’s better to keep them concise and merge them whenever possible, as too many layers can make the image bloated.

How does `docker build` work #

The Dockerfile needs to be built using docker build in order to take effect, so let’s take a closer look at how docker build is used.

Were you confused by the term “build context” when building the image just now? What does it actually mean?

I think it will be clearer to understand using the official Docker architecture diagram (note the dotted line associated with “docker build” in the diagram).

Because the command line “docker” is a simple client, the actual image building work is done by the server-side “Docker daemon.” Therefore, the “docker” client can only package and upload the “build context” directory (displaying the message “Sending build context to Docker daemon”), so that the server can obtain these local files.

Once you understand this, you will know that the “build context” is not directly related to the Dockerfile. In fact, it specifies some dependency files to be packaged into the image. The COPY command can only use relative paths based on the “build context” because the “Docker daemon” cannot see the local environment, only the files that have been packaged and uploaded.

However, this mechanism can cause trouble as well, as Docker will package and upload all the files in the directory (such as readme/.git/.svn, etc.) even if they are not needed to be copied into the image, which results in low efficiency.

To avoid this problem, you can create a .dockerignore file in the “build context” directory, which is similar in syntax to .gitignore and excludes the files that are not needed.

Below is a simple example that excludes files with the extensions “swp” and “sh” from being packaged and uploaded:

# docker ignore
*.swp
*.sh

Also, regarding the Dockerfile, it is generally recommended to explicitly specify it using the -f flag in the command line. However, if this parameter is omitted, docker build will look for a file named Dockerfile in the current directory. Therefore, if there is only one build target, it is the easiest to name the file as “Dockerfile”.

Now there shouldn’t be any difficulties using docker build, but the resulting image only has an “IMAGE ID” and no name, which is not very convenient.

To solve this, you can add a -t parameter, which is to specify the tag (label) of the image. Docker will then automatically add a name to the image after the build is complete. Of course, the name must comply with the naming conventions mentioned in the previous lesson, with the name and tag separated by a colon (:). If no tag is provided, the default is “latest”.

Summary #

Alright, today we learned about the internal structure of container images, with the key understanding that container images are composed of multiple read-only layers, and the same layer can be shared by different images, reducing storage and transmission costs.

If we have a bit more content on how to write Dockerfile, let me summarize it briefly:

Creating an image requires writing a Dockerfile, clearly specifying the steps to create the image, with each instruction generating a layer.
In the Dockerfile, the first instruction must be FROM to select the base image, commonly used ones include Alpine, Ubuntu, etc. Other commonly used instructions are: COPY, RUN, EXPOSE, which are respectively used to copy files, execute shell commands, and declare the service port number.
docker build needs to use -f to specify the Dockerfile. If not specified, it will use the file named “Dockerfile” in the current directory.
docker build needs to specify the “build context”, and the files in it will be packed and uploaded to the Docker daemon, so try not to put unnecessary files in the “build context”.
When creating an image, it is recommended to use the -t parameter and give the image a meaningful name for easy management.

We covered quite a bit today, but there are still many advanced techniques waiting for you to explore in image creation, such as using caching, multi-stage builds, etc. You can refer to the Docker official documentation (https://docs.docker.com/engine/reference/builder/) or study the images of well-known applications (such as Nginx, Redis, Node.js, etc.) for further learning.

Homework #

Finally, it’s time for homework. Here is a complete Dockerfile example. You can try to explain its meaning and then build it yourself:

# Dockerfile
# docker build -t ngx-app .
# docker build -t ngx-app:1.0 .

ARG IMAGE_BASE="nginx"
ARG IMAGE_TAG="1.21-alpine"

FROM ${IMAGE_BASE}:${IMAGE_TAG}

COPY ./default.conf /etc/nginx/conf.d/

RUN cd /usr/share/nginx/html \
    && echo "hello nginx" > a.txt

EXPOSE 8081 8082 8083

And here are two questions for you to think about:

The layers in an image are read-only and cannot be modified. However, data is often written during container runtime. How should this conflict be resolved? (Answer in the next episode)
Can you list some benefits that the layered structure of images brings?

Feel free to leave comments. If you find it helpful, you are also welcome to share it with your friends and colleagues for further discussion and learning.