Characters & Strings
Another type of data is textual data which can either be single characters or a sequence of characters which are called strings. Strings are sometimes used for human readable data such as messages or output, but may also model general data. For example, DNA is usually encoded using strings consisting of the characters C, G, A, T (corresponding to the nucleases cytosine, guanine, adenine, and thymine). Numerical characters and
punctuation can also be used in strings in which case they do not represent numbers,but instead may represent textual versions of numerical data.
Different programming languages implement characters and strings in different ways (or may even treat them the same). Some languages implement strings by defining arrays
of characters. Other languages may treat strings as dynamic data types. However, all languages use some form of character encoding to represent strings. Recall that computers only speak in binary: 0s and 1s. To represent a character like the capital letter “A”, the
binary sequence 0b1000001 is used.In fact, the most common alphanumeric characters are encoded according to the American Standard Code for Information Interchange (ASCII) text standard. The basic ASCII text standard assigns characters to the decimal values 0–127 using 7 bits to encode each character as a number. Following image shows the complete listing of standard ASCII character set.
The ASCII table was designed to enforce a lexicographic ordering: letters are in alphabetic order, uppercase precede lowercase versions, and numbers precede both. This design allows for an easy and natural comparison among strings, “alpha” would come before “beta” because they differ in the first letter. The characters have numerical values 97 and 98 respectively; since 97 < 98, the order follows. Likewise, “Alpha”would come before “alpha” (since 65 < 97), and “alpha” would come before “alphanumeric”: the sixth character is empty in the first string (usually treated as the null character with value 0) while it is “n” in the second (value of 110). This is the ordering that we would expect in a dictionary.
There are several other nice design features built into the ASCII table. For example, to convert between uppercase and lowercase versions, you only need to “flip” the second
bit (0 for uppercase, 1 for lowercase). There are also several special characters that need to be escaped to be defined. For example, though your keyboard has a tab and an enter key, if you wanted to code those characters, you would need to specify them in some way other than using those keys (since typing those keys will affect what you are typing rather than specifying a character). The standard way to escape characters is to use a backslash along with another, single character. The three most common are the (horizontal) tab, \t, the endline character, \n, and the null terminating character, \0. The tab and endline character are used to specify their whitespace characters respectively.
The null character is used in some languages to denote the end of a string and is not printable.ASCII is quite old, originally developed in the early sixties. President Johnson first
mandated that all computers purchased by the federal government support ASCII in 1968.However, it is quite limited with only 128 possible characters. Since then, additional
extensions have been developed. The Extended ASCII character set adds support for 128 additional characters (numbered 128 through 255) by adding 1 more bit (8 total).Included in the extension are support for common international characters with diacritics such as ¨u, ~n and £ (which are characters 129, 164, and 156 respectively).Even 256 possible characters are not enough to represent the wide array of international characters when you consider languages like Chinese Japanese Korean (CJK). Unicode
was developed to solve this problem by establishing a standard encoding that supports 1,112,064 possible characters, though only a fraction of these are actually currently assigned.4 Unicode is backward compatible, so it works with plain ASCII characters. In fact, the most common encoding for Unicode, UTF-8 uses a variable number of bytes to encode characters. 1-byte encodings correspond to plain ASCII, there are also 2, 3, and 4-byte encodings.
In most programming languages, strings literals are defined by using either single or double quotes to indicate where the string begins and ends. For example, one may be able to define the string “Hello World” . The double quotes are not part of the string, but instead specify where the string begins and ends. Some languages allow you to use either single or double quotes. PHP for example would allow you to also define the same string as ‘Hello World’ . Yet other languages, such as C distinguish the usage of single and double quotes: single quotes are for single characters such as ‘A’ or ‘\n’ while double quotes are used for full strings such as “Hello World” .
In any case, if you want a single or double quote to appear in your string you need to escape it similar to how the tab and endline characters are escaped. For example, in C’\” would refer to the single quote character and “Dwayne \”The Rock\” Johnson”would allow you to use double quotes within a string. In our pseudocode we’ll use the stylized double quotes, “Hello World” in any strings that we define.