- spaCy Tutorial
- spaCy - Home
- spaCy - Introduction
- spaCy - Getting Started
- spaCy - Models and Languages
- spaCy - Architecture
- spaCy - Command Line Helpers
- spaCy - Top-level Functions
- spaCy - Visualization Function
- spaCy - Utility Functions
- spaCy - Compatibility Functions
- spaCy - Containers
- Doc Class ContextManager and Property
- spaCy - Container Token Class
- spaCy - Token Properties
- spaCy - Container Span Class
- spaCy - Span Class Properties
- spaCy - Container Lexeme Class
- Training Neural Network Model
- Updating Neural Network Model
- spaCy Useful Resources
- spaCy - Quick Guide
- spaCy - Useful Resources
- spaCy - Discussion
spaCy - Pretrain Command
It is used to pre-train the “token to vector (tok2vec)” layer of pipeline components. For this purpose, it uses an approximate language-modeling objective.
The working can be understood with the help of following points −
First, we need to load the pretrained vectors and then, train a component like CNN to predict vectors which will further match the pretrained ones.
It will save the weights to a directory after each epoch.
Once saved, we can now pass a path to one of these pretrained weights files to the train command.
Now for loading weights back in during spacy train, it is recommended to ensure that all settings between pretraining and training are same.
The Pretrain command is as follows −
python -m spacy pretrain [texts_loc] [vectors_model] [output_dir][--width] [--conv-depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth][--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length][--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save-every][--init-tok2vec] [--epoch-start]
Arguments
The table below explains its arguments −
ARGUMENT | TYPE | DESCRIPTION |
---|---|---|
texts_loc | positional | This argument takes path to JSONL file with raw texts to learn from. The text is provided as the key "text" or tokens as the key "tokens". |
vectors_model | positional | It is the path or name to spaCy model with vectors to learn from. |
output_dir | positional | This argument represents the directory to write models to on each epoch. |
--width, -cw | option | It represents the width of CNN layers. |
--conv-depth, -cd | option | It represents the depth of CNN layers. |
--cnn-window, -cW | option | Introduced in version 2.2.2, represents the window size for CNN layers. |
--cnn-pieces, -cP | option | Introduced in version 2.2.2, represents Maxout size for CNN layers. For example, 1 for Mish. |
--use-chars, -chr | flag | Introduced in version 2.2.2, defines whether to use character-based embedding or not. |
--sa-depth, -sa | option | Introduced in version 2.2.2, represents the Depth of self-attention layers. |
--embed-rows, -er | option | This argument takes the number of embedding rows. |
--loss-func, -L | option | It represents the Loss function to use for the objective. For example, it can be either "cosine", "L2" or "characters". |
--dropout, -d | option | It represents the dropout rate. |
--batch-size, -bs | option | It is the number of words per training batch. |
--max-length, -xw | option | With this argument, you can specify maximum words per example. Longer examples than specified length would be discarded. |
--min-length, -nw | option | With this argument, you can specify minimum words per example. Shorter examples than specified length would be discarded. |
--seed, -s | option | As name implies, it is the seed for random number generators. |
--n-iter, -i | option | This argument is used to specify the number of iterations to pretrain. |
--use-vectors, -uv | flag | It defines whether to use the static vectors as input features or not. |
--n-save-every, -se | option | This argument will save the model every X batches. |
--init-tok2vec, -t2v | option | Introduced in version 2.1, defines the path to pretrained weights for the token-to-vector parts of the models. |
--epoch-start, -es | option | Introduced in version 2.1.5, represents the epoch to start counting at. It would only be relevant when using --init-tok2vec and the given weight file has been renamed. It also prevents unintended overwriting of existing weight files. |
Following are the JSON format for raw text −
text − Its type is Unicode, and it represents the raw input text. It would not be required if tokens are available. It is regular JSON and default output file type.
tokens − Its type is list and takes one string per token. It is used for optional tokenization.
To Continue Learning Please Login
Login with Google