Character Data Type and Operations

Paul Ngugi - May 6 - - Dev Community

A character data type represents a single character. In addition to processing numeric values, you can process characters in Java. The character data type, char, is used to represent a single character. A character literal is enclosed in single quotation marks. Consider the following code:



char letter = 'A';
char numChar = '4';


Enter fullscreen mode Exit fullscreen mode

The first statement assigns character A to the char variable letter. The second statement assigns digit character 4 to the char variable numChar.

A string literal must be enclosed in quotation marks (" "). A character literal is a single character enclosed in single quotation marks (' '). Therefore, "A" is a string, but 'A' is a character.

Unicode and ASCII code

Computers use binary numbers internally. A character is stored in a computer as a sequence of 0s and 1s. Mapping a character to its binary representation is called encoding. There are different ways to encode a character. How characters are encoded is defined by an encoding scheme.

Java supports Unicode, an encoding scheme established by the Unicode Consortium to support the interchange, processing, and display of written texts in the world’s diverse languages. Unicode was originally designed as a 16-bit character encoding. The primitive data type char was intended to take advantage of this design by providing a simple data type that could hold any character. However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all the characters in the world. The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters. Java supports the supplementary characters. The processing and representing of supplementary characters are beyond the scope of this book. For simplicity, this book considers only the original 16-bit Unicode characters. These characters can be stored in a char type variable.

A 16-bit Unicode takes two bytes, preceded by \u, expressed in four hexadecimal digits that run from \u0000 to \uFFFF.
Most computers use ASCII (American Standard Code for Information Interchange), an 8-bit encoding scheme for representing all uppercase and lowercase letters, digits, punctuation marks, and control characters. Unicode includes ASCII code, with \u0000 to \u007F corresponding to the 128 ASCII characters. Table below shows the ASCII code for some commonly used characters.

Image description

You can use ASCII characters such as 'X', '1', and '$' in a Java program as well as Unicodes. Thus, for example, the following statements are equivalent:



char letter = 'A';
char letter = '\u0041'; // Character A's Unicode is 0041


Enter fullscreen mode Exit fullscreen mode

Both statements assign character A to the char variable letter.

The increment and decrement operators can also be used on char variables to get the next or preceding Unicode character. For example, the following statements display character b.



char ch = 'a';
System.out.println(++ch);


Enter fullscreen mode Exit fullscreen mode

Escape Sequences for Special Characters

Suppose you want to print a message with quotation marks in the output. Can you write a statement like this?



System.out.println("He said "Java is fun"");


Enter fullscreen mode Exit fullscreen mode

No, this statement has a compile error. The compiler thinks the second quotation character is the end of the string and does not know what to do with the rest of characters. To overcome this problem, Java uses a special notation to represent special characters, as shown below.

Image description

This special notation, called an escape sequence, consists of a backslash (*) followed by a character or a combination of digits. For example, *\t is an escape sequence for the Tab character and an escape sequence such as \u03b1 is used to represent a Unicode. The symbols in an escape sequence are interpreted as a whole rather than individually. An escape sequence is considered as a single character.
So, now you can print the quoted message using the following statement:



System.out.println("He said \"Java is fun\"");


Enter fullscreen mode Exit fullscreen mode

The output is

He said "Java is fun"

Note that the symbols ** and **" together represent one character. The backslash ** is called an escape character. It is a special character. To display this character, you have to use an escape sequence *\*. For example, the following code



System.out.println("\\t is a tab character");


Enter fullscreen mode Exit fullscreen mode

displays

\t is a tab character

Casting between char and Numeric Types

A char can be cast into any numeric type, and vice versa. When an integer is cast into a char, only its lower 16 bits of data are used; the other part is ignored. For example:



char ch = (char)0XAB0041; // The lower 16 bits hex code 0041 is
// assigned to ch
System.out.println(ch); // ch is character A


Enter fullscreen mode Exit fullscreen mode

When a floating-point value is cast into a char, the floating-point value is first cast into an int, which is then cast into a char.



char ch = (char)65.25; // Decimal 65 is assigned to ch
System.out.println(ch); // ch is character A


Enter fullscreen mode Exit fullscreen mode

When a char is cast into a numeric type, the character’s Unicode is cast into the specified numeric type.



int i = (int)'A'; // The Unicode of character A is assigned to i
System.out.println(i); // i is 65


Enter fullscreen mode Exit fullscreen mode

Implicit casting can be used if the result of a casting fits into the target variable. Otherwise, explicit casting must be used. For example, since the Unicode of 'a' is 97, which is within the range of a byte, these implicit castings are fine:



byte b = 'a';
int i = 'a';


Enter fullscreen mode Exit fullscreen mode

But the following casting is incorrect, because the Unicode \uFFF4 cannot fit into a byte:



byte b = '\uFFF4';


Enter fullscreen mode Exit fullscreen mode

To force this assignment, use explicit casting, as follows:



byte b = (byte)'\uFFF4';


Enter fullscreen mode Exit fullscreen mode

Any positive integer between 0 and FFFF in hexadecimal can be cast into a character implicitly. Any number not in this range must be cast into a char explicitly.

All numeric operators can be applied to char operands. A char operand is automatically cast into a number if the other operand is a number or a character. If the other operand is a string, the character is concatenated with the string. For example, the following statements



int i = '2' + '3'; // (int)'2' is 50 and (int)'3' is 51
System.out.println("i is " + i); // i is 101
int j = 2 + 'a'; // (int)'a' is 97
System.out.println("j is " + j); // j is 99
System.out.println(j + " is the Unicode for character "
 + (char)j); // 99 is the Unicode for character c
System.out.println("Chapter " + '2');


Enter fullscreen mode Exit fullscreen mode

display



i is 101
j is 99
99 is the Unicode for character c
Chapter 2


Enter fullscreen mode Exit fullscreen mode

Comparing and Testing Characters

Two characters can be compared using the relational operators just like comparing two numbers. This is done by comparing the Unicodes of the two characters. For example,
'a' < 'b' is true because the Unicode for 'a' (97) is less than the Unicode for 'b' (98).
'a' < 'A' is false because the Unicode for 'a' (97) is greater than the Unicode for 'A' (65).
'1' < '8' is true because the Unicode for '1' (49) is less than the Unicode for '8' (56).

Often in the program, you need to test whether a character is a number, a letter, an uppercase letter, or a lowercase letter. The ASCII character set, that the Unicodes for lowercase letters are consecutive integers starting from the Unicode for 'a', then for 'b', 'c', . . ., and 'z'. The same is true for the uppercase letters and for numeric characters. This property can be used to write the code to test characters. For example, the following code
tests whether a character ch is an uppercase letter, a lowercase letter, or a digital character.



if (ch >= 'A' && ch <= 'Z')
 System.out.println(ch + " is an uppercase letter");
else if (ch >= 'a' && ch <= 'z')
 System.out.println(ch + " is a lowercase letter");
else if (ch >= '0' && ch <= '9')
 System.out.println(ch + " is a numeric character");


Enter fullscreen mode Exit fullscreen mode

For convenience, Java provides the following methods in the Character class for testing characters as shown below:

Image description

For example,



System.out.println("isDigit('a') is " + Character.isDigit('a'));
System.out.println("isLetter('a') is " + Character.isLetter('a'));
System.out.println("isLowerCase('a') is "
 + Character.isLowerCase('a'));
System.out.println("isUpperCase('a') is "
 + Character.isUpperCase('a'));
System.out.println("toLowerCase('T') is "
 + Character.toLowerCase('T'));
System.out.println("toUpperCase('q') is "
 + Character.toUpperCase('q'));


Enter fullscreen mode Exit fullscreen mode

displays



isDigit('a') is false
isLetter('a') is true
isLowerCase('a') is true
isUpperCase('a') is false
toLowerCase('T') is t
toUpperCase('q') is Q


Enter fullscreen mode Exit fullscreen mode
