spaCy - Convert Command

As name implies, this command will convert files into spaCy’s JavaScript Object Notation (JSON) format especially for the use with the train command and other experiment management functions.

The Convert command is as follows −

python -m spacy convert [input_file] [output_dir] [--file-type] [--converter][--n-sents] [--morphology] [--lang]

Arguments

The table below explains its arguments −

ARGUMENT	TYPE	DESCRIPTION
input_file	positional	It represents the input file.
output_dir	positional	This argument represents the output directory for converted file. Defaults to "-", meaning data will be written to stdout.
--file-type, -t	option	It is the type of file to create.
--converter, -c	option	It represents the name of the converter to use.
--n-sents, -n	option	It represents the number of sentences per document.
--seg-sents, -s	flag	It is used for Segment sentences (for -c ner).
--model, -b	option	It represents the model for parser-based sentence segmentation (for -s).
--morphology, -m	option	This argument enables appending morphology to tags.
--lang, -l	option	It is the language code and used if tokenizer required.
--help, -h	flag	This argument will show help message and other available arguments.

Following are the output file types, which can be generated with this command −

json − It is regular JSON and default output file type.
jsonl − It is Newline-delimited JSON.
msg − It is Binary MessagePack format.

Converter Options

Following table shows the converter options −

Sr.No.	ID & Description
1	Auto It will automatically pick converter based on file extension and file content.
2	conll, conllu, conllubio These are the universal dependencies .conllu or .conll format.
3	Ner It is NER with IOB/IOB2 tags. In this, one token per line with columns is separated by whitespace. The first column is the token and the final column is the IOB tag. The sentences are separated by blank lines and documents are separated by the line -DOCSTART- -X- O O. Supports CoNLL 2003 NER format.
4	Iob It is NER with IOB/IOB2 tags. In this, one sentence per line with tokens separated by whitespace and annotation separated by \|, either word\|B-ENT or word\|POS\|B-ENT.
5	Jsonl It is NER data formatted as JSONL with one dict per line and a "text" and "spans" key.