spaCy - Convert Command



As name implies, this command will convert files into spaCy’s JavaScript Object Notation (JSON) format especially for the use with the train command and other experiment management functions.

The Convert command is as follows −

python -m spacy convert [input_file] [output_dir] [--file-type] [--converter][--n-sents] [--morphology] [--lang]

Arguments

The table below explains its arguments −

ARGUMENT TYPE DESCRIPTION
input_file positional It represents the input file.
output_dir positional This argument represents the output directory for converted file. Defaults to "-", meaning data will be written to stdout.
--file-type, -t option It is the type of file to create.
--converter, -c option It represents the name of the converter to use.
--n-sents, -n option It represents the number of sentences per document.
--seg-sents, -s flag It is used for Segment sentences (for -c ner).
--model, -b option It represents the model for parser-based sentence segmentation (for -s).
--morphology, -m option This argument enables appending morphology to tags.
--lang, -l option It is the language code and used if tokenizer required.
--help, -h flag This argument will show help message and other available arguments.

Following are the output file types, which can be generated with this command −

  • json − It is regular JSON and default output file type.

  • jsonl − It is Newline-delimited JSON.

  • msg − It is Binary MessagePack format.

Converter Options

Following table shows the converter options −

Sr.No. ID & Description
1

Auto

It will automatically pick converter based on file extension and file content.

2

conll, conllu, conllubio

These are the universal dependencies .conllu or .conll format.

3

Ner

It is NER with IOB/IOB2 tags. In this, one token per line with columns is separated by whitespace. The first column is the token and the final column is the IOB tag. The sentences are separated by blank lines and documents are separated by the line -DOCSTART- -X- O O. Supports CoNLL 2003 NER format.

4

Iob

It is NER with IOB/IOB2 tags. In this, one sentence per line with tokens separated by whitespace and annotation separated by |, either word|B-ENT or word|POS|B-ENT.

5

Jsonl

It is NER data formatted as JSONL with one dict per line and a "text" and "spans" key.

Advertisements