05 Dive in and Out of Strings

05 Dive In and Out of Strings #

Hello, I’m Jingxiao.

Python programs are full of strings, and we often encounter them when reading code. Strings are also a very common data type in Python. They are used for tasks like printing logs, commenting functions in a program, accessing databases, and performing basic operations on variables, etc.

Of course, I believe you already have some understanding of strings. In today’s lesson, I will mainly review the common operations of strings and provide detailed explanations for some of the tricks used.

String Basics #

So, what is a string? A string is a sequence made up of individual characters, usually enclosed in single quotes (''), double quotes (""), or triple quotes (''' ''' or """ """, both are the same), as shown in the examples below.

name = 'jason'
city = 'beijing'
text = "welcome to jike shijian"

Here, we define three variables name, city, and text, all of which are of type string. As we know, in Python, single quotes, double quotes, and triple quotes are all the same, so the following example shows that s1, s2, and s3 are completely identical.

s1 = 'hello'
s2 = "hello"
s3 = """hello"""
s1 == s2 == s3
True

Python supports all three expression styles mentioned above for strings. One important reason for this is that it makes it convenient for you to embed strings with quotes inside other strings. For example:

"I'm a student"

Triple quotes in Python are mainly used in situations where multiple lines of strings are needed, such as function comments, etc.

def calculate_similarity(item1, item2):
    """
    Calculate similarity between two items
    Args:
        item1: 1st item
        item2: 2nd item
    Returns:
      similarity score between item1 and item2
    """

In addition, Python also supports escape characters. The so-called escape character is a string that starts with a backslash and represents a character with specific meaning. I have summarized the common escape characters into the table below.

To help you understand, I will give an example.

s = 'a\nb\tc'
print(s)
a
b	c

In this code snippet, '\n' represents a character - the newline character, and '\t' also represents a character - the tab character. So, the final output printed is the character “a”, a newline, the character “b”, and then a tab character, followed by the character “c”. However, it is important to note that although the final output spans two lines, the entire string s still consists of only 5 elements.

len(s)
5

Among the applications of escape characters, the most common one is the use of the newline character '\n'. For example, when reading files line by line, each line of the string will include a trailing newline character '\n'. However, when doing data processing, we often discard the newline character at the end of each line.

Common Operations on Strings #

Now that we have covered the basic principles of strings, let’s take a look at some common operations that can be performed on strings. You can think of a string as an array composed of individual characters, so Python strings support indexing, slicing, and iteration, just like other data structures such as lists and tuples.

name = 'jason'
name[0]
'j'
name[1:3]
'as'

Similar to other data structures like lists and tuples, the indexing of strings starts from 0. index=0 represents the first element (character), and [index:index+2] represents the substring composed of the index-th element to the (index+1)-th element.

Iterating over a string is also easy and is equivalent to iterating over each character in the string.

for char in name:
    print(char)
j
a
s
o
n

It is important to note that Python strings are immutable. Therefore, it is incorrect and not allowed to change a character within a string using the following operation:

s = 'hello'
s[0] = 'H'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment

In Python, the modification of strings usually requires creating a new string. For example, to change the first character 'h' in the string 'hello' to uppercase 'H', we can use the following approaches:

s = 'H' + s[1:]
s = s.replace('h', 'H')

The first method directly uses the uppercase 'H' to concatenate with the substring obtained through slicing of the original string to create a new string.

The second method scans the original string directly and replaces the lowercase 'h' with the uppercase 'H' to obtain a new string.

You may be familiar with mutable string types in other languages such as Java, for example StringBuilder, which allows adding, modifying, or deleting characters (or strings) without creating new strings, resulting in a constant time complexity of O(1). This greatly improves the efficiency of programs.

Unfortunately, Python does not have a similar data type. We still need to create new strings. Therefore, each time we want to modify a string, it often requires a time complexity of O(n), where n is the length of the new string.

You may have noticed that in the explanation of the above example I used words like “often” and “usually” instead of “definitely”. Why is that? Obviously, as Python has been updated, it has become smarter and the performance optimization has improved.

Here, I want to emphasize the string concatenation method using the '+=' operator, as it is an exception that breaks the immutability of strings.

The operations are as follows:

str1 += str2  # equivalent to str1 = str1 + str2

Let’s take a look at the following example:

s = ''
for n in range(0, 100000):
    s += str(n)

What do you think is the time complexity of this example?

Every time we loop, it seems that a new string needs to be created. And creating a new string each time requires a time complexity of O(n). Therefore, the overall time complexity becomes O(1) + O(2) + … + O(n) = O(n^2). Is this correct?

At first glance, this analysis seems reasonable, but it must be noted that this conclusion only applies to older versions of Python. Since Python 2.5, when performing string concatenation operations (str1 += str2), Python first checks if str1 has any other references. If not, it attempts to increment the size of the string buffer in place, rather than reallocating a new block of memory to create a new string and copy its contents. In this case, the time complexity of the example mentioned earlier is reduced to O(n).

Therefore, in the future, when you encounter string concatenation in your code and find that using ‘+=’ is more convenient, you can safely use it without worrying too much about efficiency issues.

In addition to using the addition operator, we can also use the built-in join function for string concatenation. string.join(iterable) concatenates each element in the iterable according to the specified format.

l = []
for n in range(0, 100000):
    l.append(str(n))
l = ' '.join(l)

Since the append operation for both lists and strings is O(1) complexity, the time complexity of this example with a for loop would be n*O(1) = O(n).

Next, let’s take a look at the split() function for strings. string.split(separator) splits a string into sub-strings based on the separator and returns a list of the resulting sub-strings. It is often used for parsing data. For example, if we have a file path and want to call a database API to read the corresponding data, we could write the following code:

def query_data(namespace, table):
    """
    given namespace and table, query database to get corresponding
    data         
    """

path = 'hive://ads/training_table'
namespace = path.split('//')[1].split('/')[0] # returns 'ads'
table = path.split('//')[1].split('/')[1] # returns 'training_table'
data = query_data(namespace, table)

Furthermore, there are other common functions:

  • string.strip(str) removes the specified string from the beginning and end of the string.
  • string.lstrip(str) removes the specified string only from the beginning of the string.
  • string.rstrip(str) removes the specified string only from the end of the string.

These functions are commonly used for parsing data. For example, when we read a string from a file, it may contain leading and trailing whitespace characters that we want to remove. In such cases, we can use the strip() function:

s = ' my name is jason '
s.strip()
'my name is jason'

Of course, there are many other common operations on strings in Python, such as string.find(sub, start, end), which finds the position of the substring within the string from start to end. Here, I have only emphasized a few of the most commonly used functions that are prone to mistakes. For other details, you can refer to the relevant documentation and examples or explore them on your own. I won’t go into every detail here.

String Formatting #

Finally, let’s take a look at string formatting. What exactly is string formatting?

Usually, we use a string as a template, which contains format placeholders. These placeholders reserve positions for the actual values, to display the values in the desired format. String formatting is commonly used in scenarios such as program outputs and logging.

Here is a common example. Suppose we have a task where we are given a user’s userid, and we need to query some information about that user from the database and return it. If there is no data available for that person in the database, we usually want to log this, which can be helpful for future log analytics or debugging online bugs, among other things.

We commonly express this as follows:

print('no data available for person with id: {}, name: {}'.format(id, name))

The string.format() here is the formatting function. The curly braces {} are format placeholders, used to reserve positions for the subsequent real values - the name variable in this case. If id='123' and name='jason', then the output will be:

'no data available for person with id: 123, name: jason'

It seems pretty simple, doesn’t it?

However, please note that string.format() is the latest method and specification for string formatting. Naturally, there are other ways to accomplish this, such as using the % symbol in older versions of Python. So, the example above could also be written as:

print('no data available for person with id: %s, name: %s' % (id, name))

Here, %s represents a string type, %d represents an integer type, and so on. These are common knowledge, and you should be familiar with them.

Of course, when you are writing programs now, I still recommend using the format function, as it is the latest specification and the recommended one in the official documentation.

Some may wonder why we need to use a formatting function when string concatenation can also achieve the same result. Yes, in many cases, string concatenation can indeed meet the requirements of the formatting function. However, using the formatting function is clearer, easier to read, and more standard, making it less prone to errors.

Summary #

In this lesson, we mainly learned about some basic knowledge and common operations of Python strings, and explained them with specific examples and scenarios. There are a few points to pay special attention to:

  • Python strings can be represented using single quotes, double quotes, or triple quotes, and there is no difference in meaning between them. Triple quotes are usually used in scenarios that involve multiple lines of strings.

  • In Python, strings are immutable (the exception being the concatenation operation ‘+=’ that we mentioned earlier). Therefore, it is not allowed to arbitrarily change the value of characters in a string.

  • In newer versions of Python (2.5+), string concatenation has become much more efficient, so you can use it with confidence.

  • String formatting in Python (using string.format) is often used in scenarios such as output and logging.

Thought Question #

Lastly, I leave you with a thought question. In the newer versions of Python (2.5+), which one of the following string concatenation operations do you think is more efficient? Feel free to leave a comment and share your thoughts with me. You are also welcome to share this article with your colleagues and friends.

s = ''
for n in range(0, 100000):
    s += str(n)



l = []
for n in range(0, 100000):
    l.append(str(n))
    
s = ' '.join(l)