spaCy - Pretrain Command



It is used to pre-train the “token to vector (tok2vec)” layer of pipeline components. For this purpose, it uses an approximate language-modeling objective.

The working can be understood with the help of following points −

  • First, we need to load the pretrained vectors and then, train a component like CNN to predict vectors which will further match the pretrained ones.

  • It will save the weights to a directory after each epoch.

  • Once saved, we can now pass a path to one of these pretrained weights files to the train command.

  • Now for loading weights back in during spacy train, it is recommended to ensure that all settings between pretraining and training are same.

The Pretrain command is as follows −

python -m spacy pretrain [texts_loc] [vectors_model] [output_dir][--width] [--conv-depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth][--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length][--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save-every][--init-tok2vec] [--epoch-start]

Arguments

The table below explains its arguments −

ARGUMENT TYPE DESCRIPTION
texts_loc positional This argument takes path to JSONL file with raw texts to learn from. The text is provided as the key "text" or tokens as the key "tokens".
vectors_model positional It is the path or name to spaCy model with vectors to learn from.
output_dir positional This argument represents the directory to write models to on each epoch.
--width, -cw option It represents the width of CNN layers.
--conv-depth, -cd option It represents the depth of CNN layers.
--cnn-window, -cW option Introduced in version 2.2.2, represents the window size for CNN layers.
--cnn-pieces, -cP option Introduced in version 2.2.2, represents Maxout size for CNN layers. For example, 1 for Mish.
--use-chars, -chr flag Introduced in version 2.2.2, defines whether to use character-based embedding or not.
--sa-depth, -sa option Introduced in version 2.2.2, represents the Depth of self-attention layers.
--embed-rows, -er option This argument takes the number of embedding rows.
--loss-func, -L option It represents the Loss function to use for the objective. For example, it can be either "cosine", "L2" or "characters".
--dropout, -d option It represents the dropout rate.
--batch-size, -bs option It is the number of words per training batch.
--max-length, -xw option With this argument, you can specify maximum words per example. Longer examples than specified length would be discarded.
--min-length, -nw option With this argument, you can specify minimum words per example. Shorter examples than specified length would be discarded.
--seed, -s option As name implies, it is the seed for random number generators.
--n-iter, -i option This argument is used to specify the number of iterations to pretrain.
--use-vectors, -uv flag It defines whether to use the static vectors as input features or not.
--n-save-every, -se option This argument will save the model every X batches.
--init-tok2vec, -t2v option Introduced in version 2.1, defines the path to pretrained weights for the token-to-vector parts of the models.
--epoch-start, -es option Introduced in version 2.1.5, represents the epoch to start counting at. It would only be relevant when using --init-tok2vec and the given weight file has been renamed. It also prevents unintended overwriting of existing weight files.

Following are the JSON format for raw text −

  • text − Its type is Unicode, and it represents the raw input text. It would not be required if tokens are available. It is regular JSON and default output file type.

  • tokens − Its type is list and takes one string per token. It is used for optional tokenization.

spacy_command_line_helpers.htm
Advertisements