spaCy - Train Command



As name implies, this command will train a model. The output will be in spaCy’s JSON format and on every epoch the model will be saved out to the directory.

To package the model using spaCy package command, model details and accuracy scores will be added to meta.json file.

The Train command is as follows:

python -m spacy [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping][--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec][--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level][--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel][--textcat-positive-label] [--verbose]

Arguments

The table below explains its arguments −

ARGUMENT TYPE DESCRIPTION
Lang positional This argument is used for model language.
output_path positional This argument represents the directory to store model in. It will be created if it does not pre-exist.
train_path positional It is the location of JSON-formatted training data which can be a file or a directory of files.
dev_path positional It is the location of JSON-formatted development data for evaluation which can be a file or a directory of files.
--base-model, -b option Introduced in version 2.1, represents the name of the base model to update. It is optional and can be any loadable spaCy model.
--pipeline, -p option It is also introduced in version 2.1. This is comma-separated names of pipeline components to train. The default value is 'tagger,parser,ner'.
--replace-components, -R flag This argument will replace components from the base model.
--vectors, -v option It is the model from which the vectors should be loaded.
--n-iter, -n option It will give the number of iterations. The default value is 30.
--n-early-stopping, -ne option It represents the maximum number of training epochs without dev accuracy improvement.
--n-examples, -ns option It will be the number of examples to use. The default value of 0 will use all examples.
--use-gpu, -g option Use this argument if you want to use GPU. You need to provide GPU-ID. The default value of -1 will be for CPU only.
--version, -V option It will be the model version.
--meta-path, -m option Introduced in version 2.0, represents an optional path to model meta.json. It will overwrite all the relevant properties like lang, pipeline and spacy_version.
--init-tok2vec, -t2v option Introduced in version 2.1, represents the path to pretrained weights for the token-to-vector parts of the models.
--parser-multitasks, -pt option It is the side objectives for parser CNN. For example, 'dep' or 'dep,tag'
--entity-multitasks, -et option It is the side objectives for NER CNN. For example, 'dep' or 'dep,tag'
--width, -cw option Introduced in version 2.2.4, represents the width of CNN layers of Tok2Vec component.
--conv-depth, -cd option Introduced in version 2.2.4, represents the depth of CNN layers of Tok2Vec component.
--cnn-window, -cW option Introduced in version 2.2.4, represents the window size for CNN layers of Tok2Vec component.
--cnn-pieces, -cP option Introduced in version 2.2.4, represents the maxout size for CNN layers of Tok2Vec component.
--bilstm-depth, -lstm option Introduced in version 2.2.4, represents the depth of BiLSTM layers of Tok2Vec component.
--embed-rows, -er option This argument indicates the amount of corruption for data augmentation. The value will be in float.
--orth-variant-level, -ovl option This argument indicates the orthography variation for data augmentation.
--gold-preproc, -G flag This flag will use gold preprocessing.
--learn-tokens, -T flag It is flag and Make parser learn gold-standard tokenization by merging the sub-tokens. It is typically used for languages like Chinese.
--textcat-multilabel, -TML flag Introduced in version 2.2, represents the text classification classes are not mutually exclusive (multilabel).
--textcat-arch, -ta option Introduced in version 2.2, represents the text classification model architecture. Default value is "bow".
--textcat-positive-label, -tpl option Introduced in version 2.2, represents the text classification positive label for binary classes with two labels.
--tag-map-path, -tm option Introduced in version 2.2.4, represents the location of JSON-formatted tag map.
--verbose, -VV flag Introduced in version 2.0.13,shows more detailed messages during training.
--help, -h flag This argument is used to show help message and available arguments.
spacy_command_line_helpers.htm
Advertisements