spaCy - Container Token Class



This chapter will help the readers in understanding about the Token Class in spaCy.

Token Class

As discussed previously, Token class represents an individual token such as word, punctuation, whitespace, symbol, etc.

Attributes

The table below explains its attributes −

NAME TYPE DESCRIPTION
Doc Doc It represents the parent document.
sent Span Introduced in version 2.0.12, represents the sentence span that this token is a part of.
Text unicode It is Unicode verbatim text content.
text_with_ws unicode It represents the text content, with trailing space character (if present).
whitespace_ unicode As name implies it is the trailing space character (if present).
Orth int It is the ID of the Unicode verbatim text content.
orth_ unicode It is the Unicode Verbatim text content which is identical to Token.text. This text content exists mostly for consistency with the other attributes.
Vocab Vocab This attribute represents the vocab object of the parent Doc.
tensor ndarray Introduced in version 2.1.7, represents the token’s slice of the parent Doc’s tensor.
Head Token It is the syntactic parent of this token.
left_edge Token As name implies it is the leftmost token of this token’s syntactic descendants.
right_edge Token As name implies it is the rightmost token of this token’s syntactic descendants.
I Int Integer type attribute representing the index of the token within the parent document.
ent_type int It is the named entity type.
ent_type_ unicode It is the named entity type.
ent_iob int It is the IOB code of named entity tag. Here, 3 = the token begins an entity, 2 = it is outside an entity, 1 = it is inside an entity, and 0 = no entity tag is set.
ent_iob_ unicode It is the IOB code of named entity tag. “B” = the token begins an entity, “I” = it is inside an entity, “O” = it is outside an entity, and "" = no entity tag is set.
ent_kb_id int Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of.
ent_kb_id_ unicode Introduced in version 2.2, represents the knowledge base ID that refers to the named entity this token is a part of.
ent_id int It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution.
ent_id_ unicode It is the ID of the entity the token is an instance of (if any). This attribute is currently not used, but potentially for coreference resolution.
Lemma int Lemma is the base form of the token, having no inflectional suffixes.
lemma_ unicode It is the base form of the token, having no inflectional suffixes.
Norm int This attribute represents the token’s norm.
norm_ unicode This attribute represents the token’s norm.
Lower int As name implies, it is the lowercase form of the token.
lower_ unicode It is also the lowercase form of the token text which is equivalent to Token.text.lower().
Shape int To show orthographic features, this attribute is for transform of the token’s string.
shape_ unicode To show orthographic features, this attribute is for transform of the token’s string.
Prefix int It is the hash value of a length-N substring from the start of the token. The defaults value is N=1.
prefix_ unicode It is a length-N substring from the start of the token. The default value is N=1.
Suffix int It is the hash value of a length-N substring from the end of the token. The default value is N=3.
suffix_ unicode It is the length-N substring from the end of the token. The default value is N=3.
is_alpha bool This attribute represents whether the token consist of alphabetic characters or not? It is equivalent to token.text.isalpha().
is_ascii bool This attribute represents whether the token consist of ASCII characters or not? It is equivalent to all(ord(c) < 128 for c in token.text).
is_digit Bool This attribute represents whether the token consist of digits or not? It is equivalent to token.text.isdigit().
is_lower Bool This attribute represents whether the token is in lowercase or not? It is equivalent to token.text.islower().
is_upper Bool This attribute represents whether the token is in uppercase or not? It is equivalent to token.text.isupper().
is_title bool This attribute represents whether the token is in titlecase or not? It is equivalent to token.text.istitle().
is_punct bool This attribute represents whether the token a punctuation?
is_left_punct bool This attribute represents whether the token a left punctuation mark, e.g. '(' ?
is_right_punct bool This attribute represents whether the token a right punctuation mark, e.g. ')' ?
is_space bool This attribute represents whether the token consist of whitespace characters or not? It is equivalent to token.text.isspace().
is_bracket bool This attribute represents whether the token is a bracket or not?
is_quote bool This attribute represents whether the token a quotation mark or not?
is_currency bool Introduced in version 2.0.8, this attribute represents whether the token is a currency symbol or not?
like_url bool This attribute represents whether the token resemble a URL or not?
like_num bool This attribute represents whether the token represent a number or not?
like_email bool This attribute represents whether the token resemble an email address or not?
is_oov bool This attribute represents whether the token have a word vector or not?
is_stop bool This attribute represents whether the token is part of a “stop list” or not?
Pos int It represents the coarse-grained part-of-speech from the Universal POS tag set.
pos_ unicode It represents the coarse-grained part-of-speech from the Universal POS tag set.
Tag int It represents the fine-grained part-of-speech.
tag_ unicode It represents the fine-grained part-of-speech.
Dep int This attribute represents the syntactic dependency relation.
dep_ unicode This attribute represents the syntactic dependency relation.
Lang Int This attribute represents the language of the parent document’s vocabulary.
lang_ unicode This attribute represents the language of the parent document’s vocabulary.
Prob float It is the smoothed log probability estimate of token’s word type.
Idx int It is the character offset of the token within the parent document.
Sentiment float It represents a scalar value that indicates the positivity or negativity of the token.
lex_id int It represents the sequential ID of the token’s lexical type which is used to index into tables.
Rank int It represents the sequential ID of the token’s lexical type which is used to index into tables.
Cluster int It is the Brown cluster ID.
_ Underscore It represents the user space for adding custom attribute extensions.

Methods

Following are the methods used in Token class −

Sr.No. Method & Description
1 Token._ _init_ _

It is used to construct a Token object.

2 Token.similarity

It is used to compute a semantic similarity estimate.

3 Token.check_flag

It is used to check the value of a Boolean flag.

4 Token._ _len_ _

It is used to calculate the number of Unicode characters in the token.

ClassMethods

Following are the classmethods used in Token class −

Sr.No. Classmethod & Description
1 Token.set_extension

It defines a custom attribute on the Token.

2 Token.get_extension

It will look up a previously extension by name.

3 Token.has_extension

It will check whether an extension has been registered on the Token class or not.

4 Token.remove_extension

It will remove a previously registered extension on the Token class.

Advertisements