Python Program to Determine the Unicode Code Point at a given index


A Unicode code point is a unique number that represents a number in the Unicode character set. Unicode is a character encoding standard that is used to assign unique codes to every character in the world. Unicode supports around 130,000 characters including letters, symbols, and emojis.We can determine the Unicode Code point at a specific index using the ord() function,codecs module in Python,unicodedata module, and array module in Python. In this article, we will discuss how we can determine the Unicode code point at a given index using all these methods.

Unicode Code Point

According to the Unicode code point, every character is represented by a unique number. The code point is represented in hexadecimal notation and consists of a “U+” prefix followed by a four or five-digit hexadecimal number.

Python Program to Determine Unicode Code Point

Method 1: Using the ord() function.

We can use the ord() function in Python to get the Unicode code of a character at a given index. The ord() function takes a single character as an argument and returns the Unicode code point for that character.

Syntax

code_point = ord(string[index])

Here,the ord() function takes a single character string as its argument and returns the Unicode code point of that character as an integer.

Example

In the below example, we first get the character at a specific index in the string and then pass that character to the ord() function in Python to get the Unicode code point of that character.

# Get the Unicode code point at a given index
def get_unicode_code_point(string, index):
   char = string[index]
   code_point = ord(char)
   return code_point

# Test the function
string = "Hello, World!"
index = 1
code_point = get_unicode_code_point(string, index)
print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")

Output

The Unicode code point of the character 'e' at index 1 is U+0065.

Method 2: Using the codecs module

The codecs module provides a method called codecs.encode() that can be used to encode a string in a specified encoding format. We can use this method to encode a single character in the UTF-8 encoding format and then use the bytearray() function to convert the encoded character to an array of bytes. We can then extract the Unicode code point from the bytes using the struct module.

Syntax

import codecs
byte_string = string.encode('utf-8')
code_point = int(codecs.encode(byte_string[index:index+1], 'hex'), 16)

Here, we use the codecs.encode() function to encode the byte string in hexadecimal format, which returns a string of the form "XX", where XX is a two-digit hexadecimal representation of the byte. We convert this string to an integer using the int() function with a base of 16 (since the string is in hexadecimal format) to get the Unicode code point of the character.

Example

In the below example, we first encode the character at index 1 of the string "Hello, World!" using the UTF-8 encoding format and store the resulting byte string in the byte_string variable. We then pass the byte_string to the codecs.decode() method, specifying the 'unicode_escape' codec to decode the byte string as a Unicode escape sequence. This produces a Unicode string, which we then encode again using the UTF-16BE encoding format and store in the code_point variable.Finally, we use the int.from_bytes() method to convert the byte string to an integer and print the Unicode code point in hexadecimal notation with a "U+" prefix using a formatted string literal.

import codecs

string = "Hello, World!"
index = 1
char = string[index]
byte_string = char.encode('utf-8')
code_point = codecs.decode(byte_string, 'unicode_escape').encode('utf-16be')
code_point = int.from_bytes(code_point, byteorder='big')
print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")

Output

The Unicode code point of the character 'e' at index 1 is U+0065.

Method 3: Using the unicodedata module

The unicodedata module provides a function called unicodedata.name() that can be used to get the name of a Unicode character. We can use this function to get the name of the character at a given index and then use the unicodedata.lookup() function to get the Unicode code point of the character.

Syntax

import unicodedata
code_point = ord(char)
if unicodedata.combining(char):
   prev_char = string[index - 1]
   prev_code_point = ord(prev_char)
   code_point = prev_code_point + (code_point - 0xDC00) + ((prev_code_point - 0xD800) << 10)

Here, we first get the character at the specified index of the string and store it in the char variable. We then use the built-in ord() function to get the Unicode code point of the character.If the character is a combining character (i.e., a character that modifies the appearance of the preceding character, such as an accent mark), we need to use some extra logic to get the full Unicode code point. In this case, we get the previous character in the string and get its Unicode code point using ord(). We then use some bitwise operations to combine the two code points and get the full Unicode code point of the combined character.

Example

In the below example, we used the unicodedata module to get the name of the character 'e' at index 1 of the string "Hello, World!" using the unicodedata.name() function. We then extracted the Unicode code point from the name using the int() function and used formatted string literals (f-strings) to print the code point in hexadecimal notation with a "U+" prefix.

import unicodedata

string = "Hello, World!"
index = 1
char = string[index]
name = unicodedata.name(char)
code_point = int(name.split(' ')[-1], 16)
print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")

Output

The Unicode code point of the character 'e' at index 1 is U+000E.

Method 4: Using array Module

The array module provides a class called array.array() that can be used to create arrays of a specified type. We can create an array of unsigned integers and append the Unicode code point of each character in the string to the array. We can then access the Unicode code point of the character at a given index by indexing into the array.

Syntax

import array
byte_array = array.array('b', char.encode('utf-8'))
code_point = int.from_bytes(byte_array, 'big')

Here, we first encode the character at the specified index of the string using the UTF-8 encoding format and store the resulting byte string in the byte_array variable as a signed byte array. We then use the int.from_bytes() method with a byte order of 'big' to convert the byte array to an integer value and get the Unicode code point of the character.

Example

In the below example, we used the array module to create an array of unsigned integers using the array.array() function. We used a list comprehension to append the Unicode code point of each character in the string "Hello, World!" to the array. We then indexed into the array to get the Unicode code point of the character at index 1. We used formatted string literals (f-strings) to print the code point in hexadecimal notation with a "U+" prefix.

import array

string = "Hello, World!"
index = 1
code_points = array.array('I', [ord(char) for char in string])
code_point = code_points[index]
print(f"The Unicode code point of the character '{string[index]}' at index {index} is U+{code_point:04X}.")

Output

The Unicode code point of the character 'e' at index 1 is U+0065.

Conclusion

In this article, we have discussed how we can determine the Unicode point at a given index. Unicode code points can be determined for each character using the ord() function of Python. A Unicode code point is a unique number given for each character representation.

Updated on: 11-Jul-2023

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements