spaCy - Pretrain Command

It is used to pre-train the “token to vector (tok2vec)” layer of pipeline components. For this purpose, it uses an approximate language-modeling objective.

The working can be understood with the help of following points −

First, we need to load the pretrained vectors and then, train a component like CNN to predict vectors which will further match the pretrained ones.
It will save the weights to a directory after each epoch.
Once saved, we can now pass a path to one of these pretrained weights files to the train command.
Now for loading weights back in during spacy train, it is recommended to ensure that all settings between pretraining and training are same.

The Pretrain command is as follows −

python -m spacy pretrain [texts_loc] [vectors_model] [output_dir][--width] [--conv-depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth][--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length][--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save-every][--init-tok2vec] [--epoch-start]

Arguments

The table below explains its arguments −

ARGUMENT	TYPE	DESCRIPTION
texts_loc	positional	This argument takes path to JSONL file with raw texts to learn from. The text is provided as the key "text" or tokens as the key "tokens".
vectors_model	positional	It is the path or name to spaCy model with vectors to learn from.
output_dir	positional	This argument represents the directory to write models to on each epoch.
--width, -cw	option	It represents the width of CNN layers.
--conv-depth, -cd	option	It represents the depth of CNN layers.
--cnn-window, -cW	option	Introduced in version 2.2.2, represents the window size for CNN layers.
--cnn-pieces, -cP	option	Introduced in version 2.2.2, represents Maxout size for CNN layers. For example, 1 for Mish.
--use-chars, -chr	flag	Introduced in version 2.2.2, defines whether to use character-based embedding or not.
--sa-depth, -sa	option	Introduced in version 2.2.2, represents the Depth of self-attention layers.
--embed-rows, -er	option	This argument takes the number of embedding rows.
--loss-func, -L	option	It represents the Loss function to use for the objective. For example, it can be either "cosine", "L2" or "characters".
--dropout, -d	option	It represents the dropout rate.
--batch-size, -bs	option	It is the number of words per training batch.
--max-length, -xw	option	With this argument, you can specify maximum words per example. Longer examples than specified length would be discarded.
--min-length, -nw	option	With this argument, you can specify minimum words per example. Shorter examples than specified length would be discarded.
--seed, -s	option	As name implies, it is the seed for random number generators.
--n-iter, -i	option	This argument is used to specify the number of iterations to pretrain.
--use-vectors, -uv	flag	It defines whether to use the static vectors as input features or not.
--n-save-every, -se	option	This argument will save the model every X batches.
--init-tok2vec, -t2v	option	Introduced in version 2.1, defines the path to pretrained weights for the token-to-vector parts of the models.
--epoch-start, -es	option	Introduced in version 2.1.5, represents the epoch to start counting at. It would only be relevant when using --init-tok2vec and the given weight file has been renamed. It also prevents unintended overwriting of existing weight files.

Following are the JSON format for raw text −

text − Its type is Unicode, and it represents the raw input text. It would not be required if tokens are available. It is regular JSON and default output file type.
tokens − Its type is list and takes one string per token. It is used for optional tokenization.

spacy_command_line_helpers.htm