Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python Program to Determine the Unicode Code Point at a given index
A Unicode code point is a unique number that represents a character in the Unicode standard. Unicode supports over 130,000 characters including letters, symbols, and emojis. Python provides several methods to determine the Unicode code point at a specific index: ord() function, codecs module, unicodedata module, and array module.
What is a Unicode Code Point?
Every Unicode character has a unique numeric identifier called a code point. Code points are represented in hexadecimal notation with a "U+" prefix followed by a four or more digit hexadecimal number (e.g., U+0065 for 'e').
Method 1: Using ord() Function
The ord() function is the simplest way to get the Unicode code point of a character at a given index ?
Syntax
code_point = ord(string[index])
Example
def get_unicode_code_point(string, index):
char = string[index]
code_point = ord(char)
return code_point
# Test the function
string = "Hello, World!"
index = 1
code_point = get_unicode_code_point(string, index)
print(f"Character '{string[index]}' at index {index} has Unicode code point U+{code_point:04X}")
Character 'e' at index 1 has Unicode code point U+0065
Method 2: Using codecs Module
The codecs module provides encoding/decoding functionality that can help extract Unicode code points ?
Example
import codecs
string = "Hello, World!"
index = 1
char = string[index]
# Get the code point using ord() (most straightforward with codecs)
code_point = ord(char)
print(f"Character '{char}' at index {index} has Unicode code point U+{code_point:04X}")
# Alternative: Using UTF-8 encoding
byte_string = char.encode('utf-8')
print(f"UTF-8 bytes: {byte_string}")
Character 'e' at index 1 has Unicode code point U+0065 UTF-8 bytes: b'e'
Method 3: Using unicodedata Module
The unicodedata module provides additional Unicode character information ?
Example
import unicodedata
string = "Hello, World! ?"
index = 14 # Star emoji
char = string[index]
code_point = ord(char)
try:
name = unicodedata.name(char)
print(f"Character '{char}' at index {index}")
print(f"Unicode code point: U+{code_point:04X}")
print(f"Character name: {name}")
except ValueError:
print(f"Character '{char}' has no Unicode name")
Character '?' at index 14 Unicode code point: U+1F31F Character name: GLOWING STAR
Method 4: Using array Module
The array module can store Unicode code points efficiently for multiple characters ?
Example
import array
string = "Hello, World!"
index = 1
# Create array of Unicode code points for all characters
code_points = array.array('I', [ord(char) for char in string])
code_point = code_points[index]
print(f"Character '{string[index]}' at index {index} has Unicode code point U+{code_point:04X}")
print(f"All code points: {[f'U+{cp:04X}' for cp in code_points[:5]]}...")
Character 'e' at index 1 has Unicode code point U+0065 All code points: ['U+0048', 'U+0065', 'U+006C', 'U+006C', 'U+006F']...
Comparison
| Method | Complexity | Best For |
|---|---|---|
ord() |
Simple | Single character lookup |
| codecs module | Complex | Encoding/decoding operations |
| unicodedata module | Medium | Character names and properties |
| array module | Medium | Bulk operations on multiple characters |
Conclusion
The ord() function is the most direct method to get Unicode code points at a given index. Use unicodedata for character names and properties, and array module for processing multiple characters efficiently.
