02 Essential Concepts You Need to Master in Programming Development

02 Serial Talk: Basic Concepts You Need to Master in Programming Development #

Here is the English translation keeping the markdown formatting and without omitting any content:

Serially Covering: The Basic Concepts You Need to Master in Programming Development #

Hi, I’m Chen Tian.

In the previous lesson we learned about the basic workings of memory - to briefly recap: data on the stack is static, fixed size, fixed lifetime; data on the heap is dynamic, variable size, variable lifetime.

Today we’ll continue by combing through other basic concepts frequently encountered in programming development. There are many small conceptual points that need to be grasped, so for ease of learning I’ve organized them into four categories: data (values and types, pointers and references), code (functions, methods, closures, interfaces and vtables), execution (concurrency and parallelism, synchronization and asynchrony, Promise/async/await), and programming paradigms (generic programming).

image

I hope that by reviewing these concepts, you can solidify the fundamentals of software development. This is crucial for understanding many difficult points in Rust later on, like ownership, dynamic dispatch, concurrency handling, etc.

Alright, without further ado, let’s dive in.

Data #

Data are the objects that programs operate on. Programs without data processing are meaningless, so first we’ll review concepts related to data, including values and types, pointers and references.

Values and Types #

Strictly speaking, types are the differentiation of values. They contain information like the length of the value in memory, alignment, and what operations can be performed on the value. A value is a specific entity conforming to a certain data type. For example, 64u8 is of type u8, corresponding to an integer entity that takes up 1 byte in size with value range 0~255. This specific entity is 64.

Values are stored as streams of bytes using the representation specified by their type for access. For example, 64 is stored in memory as 0x40, or 0b0100 0000.

Here you need to note that values cannot be discussed separate from their specific types. If the byte 0x40 in memory has type ASCII char, then it does not mean 64, but rather the @ symbol.

Whether in strongly typed or weakly typed languages, concrete representations of types exist internally. Generally, types in programming languages can be divided into two categories:

Primitive types are the most basic data types provided by the language. For example, chars, integers, floats, booleans, arrays, tuples, pointers, references, functions, closures, etc. The sizes of all primitive types are fixed, so they can be allocated on the stack.

Composite types, also called compound types, refer to types formed by combining multiple primitive types and other types. Composite types can also be divided into two subcategories:

  • Structure types: Multiple types combined together to jointly express the complex data structure of a value. For example, a Person struct containing name, age, email, etc. Using the terminology of algebraic data types, structs are product types.

  • Tagged unions: Also called disjoint unions. Can store an object of one of multiple different but fixed types, the specific type is determined by a tag. For example, Haskell’s Maybe type or Swift’s Optional are tagged unions. In algebraic data type terminology, tagged unions are sum types.

Additionally, many languages do not support tagged unions but just take the tag portion and provide enum types. Enums are a weaker subtype of tagged unions with less functionality.

Looking at the definitions may not make it very clear, you can refer to this diagram:

image

Pointers and References #

In memory, a value is stored at a particular location corresponding to a memory address. A pointer is a value holding a memory address that can be dereferenced to access the memory it points to, theoretically able to be dereferenced to any data type.

References are very similar to pointers, the difference is that dereferencing of a reference is restricted - it can only dereference to the type of data it refers to, it cannot be repurposed. For example, a reference pointing to the value 42u8 when dereferenced can only use the u8 data type.

So pointers are less restricted in use, but can also cause more harm. If a pointer is dereferenced with the wrong type, it will trigger all kinds of memory issues leading to system crashes or potential security vulnerabilities.

As mentioned earlier, pointers and references are primitive types that can be allocated on the stack.

Depending on what data they point to, some references require a pointer to the memory address plus the length of the address and other information.

As mentioned last lesson, in addition to a pointer to the memory address, a pointer to the “hello world” string also contains the string length and capacity, using a total of 3 words, occupying 24 bytes on a 64-bit CPU. We call such pointers carrying extra information compared to normal pointers fat pointers. Many data structure references are implemented internally using fat pointers.

Code #

Data are the objects programs operate on, while code is the body that runs programs, and is how we as developers convert requirements in the physical world into logic in the digital world. We’ll discuss functions and closures, interfaces and vtables.

Functions, Methods, and Closures #

Functions are a basic element of programming languages, encapsulating a set of related statements and expressions to accomplish some functionality. Functions also abstract repeated behaviors in code. In modern programming languages, functions are often first-class citizens, meaning they can be passed as parameters, returned as values, and be part of composite types.

In object-oriented languages, functions defined within classes or objects are called methods. Methods are often related to object pointers, such as Python’s self reference or Java’s this reference.

Closures are a data structure that stores functions, or rather code and its environment together. Free variables from the context referenced in the function body will be captured into the closure structure and become part of the closure type.

Generally speaking, if a language has first-class functions, then it must support closures, because functions as return values often need to return a closure.

Refer to the diagram below showing a closure capturing the context environment. You can run this code snippet:

image

Interfaces and Vtables #

Interfaces are a core part of software system development, reflecting the system designer’s abstract understanding of the system. As an abstraction layer, interfaces separate users and implementors so they have no direct dependencies, greatly improving reusability and extensibility.

Many programming languages have the concept of interfaces, allowing developers to design to interfaces, such as Java’s interface, Elixir’s behaviour, Swift’s protocol, and Rust’s trait.

For example, in HTTP, the Request/Response service handling model is a typical interface. We just need to define how to map from Request to Response for different inputs according to the interface definition. Through this interface, the system can dispatch Requests meeting the requirements to our service in suitable scenarios.

Interface-oriented design is an important capability in software development, and Rust places particular emphasis on interface capabilities. When we get to the Trait section later, we’ll introduce how to use Traits for interface design in detail.

When we use interfaces at runtime to reference concrete types, code gains runtime polymorphism capabilities. However, at runtime, once an interface reference is used, the variable’s original type is erased and we can no longer deduce what capabilities a pointer has just from analysis.

Therefore, when generating such a reference, we need to build a fat pointer that not only points to the data itself, but also points to a list covering the methods supported by the interface. This list is the familiar virtual table (vtable).

The diagram below shows the process of a Vec data’s type being erased at runtime to generate a reference pointing to the Write interface:

image

Since the vtable records the executable interface methods of the data, at runtime we can dynamically dispatch based on context if we want different implementations for an interface.

For example, I want to implement formatters for different languages for a text editor’s Formatter interface. When the editor loads, we can put all supported languages and their formatter tools into a hash map, with language type as key and a reference to each formatter tool implementing the Formatter interface as value. This way, when the user opens a file in the editor, we can find the corresponding Formatter reference based on file type to format it.

Execution #

How code runs after being loaded often determines the execution efficiency of the program. Next we’ll discuss concurrency, parallelism, synchronization, asynchrony, and some important concepts in asynchrony like Promise/async/await.

Concurrency vs Parallelism #

Concurrency and parallelism are concepts frequently encountered in software development.

Concurrency is the ability to deal with multiple things at once. For example, the system can suspend and switch to task 2 after task 1 makes some progress, saving task 1’s context. Then after some time it switches back to task 1 again.

Parallelism is the means to handle multiple things simultaneously. That is, task 1 and task 2 can work within the same time slice without needing context switches. The diagram below illustrates the difference well:

image

Concurrency is an ability, while parallelism is a means. When our system gains concurrency capabilities, code can run in parallel when spread across multiple CPU cores. So we normally discuss high concurrency handling rather than high parallelism.

Many programming languages with high concurrency handling embed an M:N scheduler in user programs, reasonably spreading M concurrent tasks across N CPU cores to maximize throughput.

Synchronization and Asynchrony #

Synchronization means subsequent operations block after a task starts executing until the task finishes. In software, most of our code executes synchronously - for example, the CPU only executes the next instruction after the previous pipeline stage completes. A function A calling functions B and C will also execute B fully before executing C.

Synchronous execution guarantees causality in code, ensuring correctness.

However, when encountering I/O processing, the huge gap between efficient CPU instructions and inefficient I/O becomes a performance killer for software. This diagram comparing CPU, memory, I/O device, and network latency illustrates:

image

We can see I/O operations are lower than memory access by two orders of magnitude. Once I/O operations occur, the CPU can only idle and wait for them to complete. Therefore, the operating system provides asynchronous I/O to applications, allowing them to use CPU time for other tasks before the current I/O completes.

So asynchrony means subsequent operations unrelated causally to a task can execute normally without waiting for the first to finish.

In asynchronous operations, the result after asynchronous processing generally uses a Promise to hold it. This is an object describing a value that will be available at some point in the future, generally having three states:

  1. Initial state, Promise not yet run
  2. Pending state, Promise started but not finished
  3. Settled state, Promise successfully resolved a value, or failed execution

If you’re not too familiar with the term Promise, in many languages supporting asynchrony it’s also called Future/Delay/Deferred. In addition to this term, we also often see the async/await keywords.

In general, async defines a task that can execute concurrently, while await triggers concurrent execution of that task. In most languages, async/await is syntactic sugar that wraps Promises with a state machine, making asynchronous calls feel very similar to synchronous calls, also making code easier to read.

Programming Paradigms #

To better maintain code during constant iteration, we also introduce all kinds of programming paradigms to improve code quality. So finally let’s talk about generic programming.

If you come from a weakly typed language like C/Python/JavaScript, generic programming is a concept and skill you need to focus on mastering. Generic programming has two levels - generics in data structures, and genericizing code using generic structures.

Generic Data Structures #

First is generic data structures, also often called parameterized types or parametric polymorphism, such as this data structure:

struct Connection<S> {
  io: S,
  state: State,
}

It has a parameter S, whose internal field io is of type S. The concrete type of S is only bound in the context Connection is used. You can think of parameterized data structures as functions generating types - when “called”, they accept parameters of concrete types and return a type carrying those types. For example, if we provide TcpStream for S, then it generates the Connection type where io is of type TcpStream.

Here you may wonder, if S can be any type, how do we know what behaviors S has? If we want to call io.send() to send data, how does the compiler know S contains this method?

This is a good question. We need to constrain S using an interface. So you’ll often see that languages supporting generic programming also provide powerful interface programming capabilities. When covering Rust’s traits later, I’ll explore this issue further.

Generic data structures provide advanced abstraction, like how we humans use numbers to abstract specific quantities, then invented algebra to further abstract specific numbers. It brings the benefit of delayed binding, making data structures more generic and applicable to more scenarios, also greatly reducing code duplication and improving maintainability.

Genericizing Code #

The other level of generic programming is writing generic code using generic structures. When we write code using generic structures, related code also needs additional abstraction.

Using the binary search example we’re familiar with will explain this clearly:

image

The binary search in C on the left has marked operations implying relation to int[]. So if we want to binary search different data types, the implementation also needs to change accordingly. The C++ implementation on the right abstracts these parts so we can use the same code to binary search iterator data types.

Similarly, such code can be used in more generalized scenarios and is more concise and maintainable.

Summary #

Today we discussed four categories of basic concepts: data, code, execution, and programming paradigms.

image

Values cannot be discussed separate from types. Types are generally divided into primitive and composite types. Pointers and references both point to a value’s memory address, just their dereferencing behaviors differ. References can only dereference to the original data type, while pointers do not have this restriction, however unconstrained pointer dereferencing introduces memory safety issues.

Functions abstract repeated behaviors in code, while methods are functions defined internally in objects. Closures are a special kind of function that captures free variables from the context used in the function body as part of the closure.

Interfaces isolate callers and implementors, greatly promoting code reuse and extensibility. Programming to interfaces allows the system to be flexible. When using interfaces to reference concrete types, we need vtables to aid execution of runtime code. Vtables allow easy dynamic dispatch, forming the basis of runtime polymorphism.

For code execution, concurrency is the basis of parallelism, the ability to deal with multiple tasks concurrently. Parallelism is the realization of concurrency, the means to handle multiple tasks simultaneously. Synchronization blocks subsequent operations while asynchrony allows them to proceed. Promise, widely used in asynchronous operations, represents the result available at some future point in time. async/await wraps Promises, generally implemented using state machines.

Generic programming uses parameterization to make data structures delay binding like functions, improving their generality. Type parameters can be constrained by interfaces to make types satisfy certain behaviors. Also, when using generic structures, our code needs higher abstraction.

These basic concepts are crucial for understanding many Rust concepts later. Please comment if some concepts are still unclear to you so we can discuss them further.

Review Questions #

(Since we haven’t covered specific Rust syntax yet, you can use your usual programming language to think about these questions and solidify your understanding of basic concepts)

  1. If there is a pointer to a function that is dereferenced to a list, then an element is inserted into the list, what will happen? (Compare across languages and see if this operation is allowed and what happens if so)

  2. We want to construct a Shape data structure that can be a Rectangle, Circle, or Triangle as defined below. What data structure should Shape use and how to implement it?

struct Rectangle {
  a: f64,
  b: f64, 
}

struct Circle {
  r: f64,  
}

struct Triangle {
  a: f64,
  b: f64,
  c: f64,
}
  1. For the three structures above, if we want to define an interface to calculate perimeter and area, how would we calculate them?

Please feel free to share your thoughts in the comments! This is the second lesson in our journey - if you found it helpful, also feel free to share with your friends and invite them to discuss together.

Reference Materials #

Latency numbers every programmer should know, comparing CPU, memory, I/O device, and network latency