- spaCy Tutorial
- spaCy - Home
- spaCy - Introduction
- spaCy - Getting Started
- spaCy - Models and Languages
- spaCy - Architecture
- spaCy - Command Line Helpers
- spaCy - Top-level Functions
- spaCy - Visualization Function
- spaCy - Utility Functions
- spaCy - Compatibility Functions
- spaCy - Containers
- Doc Class ContextManager and Property
- spaCy - Container Token Class
- spaCy - Token Properties
- spaCy - Container Span Class
- spaCy - Span Class Properties
- spaCy - Container Lexeme Class
- Training Neural Network Model
- Updating Neural Network Model
- spaCy Useful Resources
- spaCy - Quick Guide
- spaCy - Useful Resources
- spaCy - Discussion
spaCy - Convert Command
As name implies, this command will convert files into spaCy’s JavaScript Object Notation (JSON) format especially for the use with the train command and other experiment management functions.
The Convert command is as follows −
python -m spacy convert [input_file] [output_dir] [--file-type] [--converter][--n-sents] [--morphology] [--lang]
Arguments
The table below explains its arguments −
ARGUMENT | TYPE | DESCRIPTION |
---|---|---|
input_file | positional | It represents the input file. |
output_dir | positional | This argument represents the output directory for converted file. Defaults to "-", meaning data will be written to stdout. |
--file-type, -t | option | It is the type of file to create. |
--converter, -c | option | It represents the name of the converter to use. |
--n-sents, -n | option | It represents the number of sentences per document. |
--seg-sents, -s | flag | It is used for Segment sentences (for -c ner). |
--model, -b | option | It represents the model for parser-based sentence segmentation (for -s). |
--morphology, -m | option | This argument enables appending morphology to tags. |
--lang, -l | option | It is the language code and used if tokenizer required. |
--help, -h | flag | This argument will show help message and other available arguments. |
Following are the output file types, which can be generated with this command −
json − It is regular JSON and default output file type.
jsonl − It is Newline-delimited JSON.
msg − It is Binary MessagePack format.
Converter Options
Following table shows the converter options −
Sr.No. | ID & Description |
---|---|
1 | Auto It will automatically pick converter based on file extension and file content. |
2 | conll, conllu, conllubio These are the universal dependencies .conllu or .conll format. |
3 | Ner It is NER with IOB/IOB2 tags. In this, one token per line with columns is separated by whitespace. The first column is the token and the final column is the IOB tag. The sentences are separated by blank lines and documents are separated by the line -DOCSTART- -X- O O. Supports CoNLL 2003 NER format. |
4 | Iob It is NER with IOB/IOB2 tags. In this, one sentence per line with tokens separated by whitespace and annotation separated by |, either word|B-ENT or word|POS|B-ENT. |
5 | Jsonl It is NER data formatted as JSONL with one dict per line and a "text" and "spans" key. |
To Continue Learning Please Login
Login with Google