Subclasses#

dacy.subclasses.wrappers#

Convenient wrapper functions for wrapping DaNLP and Huggingface models in a SpaCy text processing pipeline.

dacy.subclasses.wrappers.add_huggingface_model(nlp: spacy.language.Language, download_name: str, doc_extension: str, model_name: str, category: str, labels: list, force_extension: bool = False) spacy.language.Language[source]#

adds a Huggingface sequence classification model to the spacy pipeline.

Parameters
  • nlp (Language) – A spacy text-processing pipeline

  • download_name (str) – the name of the model you wish to download

  • doc_extension (str) – The extension to the doc which you wish the save the transformer data under. This includes output tensor, wordpieces and more.

  • model_name (str) – What you want your model to be called in the nlp pipeline

  • category (str) – The category of the output. This is the label which is used to extract from the model. E.g. “sentiment” would allow you to extract the sentiment from doc._.sentiment

  • labels (list) – The labels of the model

  • force_extension (bool, optional) – Set the extension to the doc regardless of whether it already exists. Defaults to False.

Returns

your text processing pipeline with the transformer model included

Return type

Language

Example

>>> add_huggingface_model(nlp, download_name="pin/senda", doc_extension="senda_trf_data", model_name="senda", category="polarity", labels=["negative", "neutral", "positive"])

dacy.subclasses.classification_transformer#

Functions for wrapping a sequence classification transformer in a SpaCy pipeline

class dacy.subclasses.classification_transformer.ClassificationTransformer(vocab: spacy.vocab.Vocab, model: thinc.model.Model[typing.List[spacy.tokens.doc.Doc], spacy_transformers.data_classes.FullTransformerBatch], set_extra_annotations: typing.Callable = <function null_annotation_setter>, *, name: str = 'classification_transformer', max_batch_items: int = 4096, doc_extension_attribute)[source]#

Bases: spacy_transformers.pipeline_component.Transformer

from_disk(path: Union[str, pathlib.Path], *, num_labels: int, exclude: Iterable[str] = ()) spacy_transformers.pipeline_component.Transformer[source]#

Load the pipe from disk. For more see: https://spacy.io/api/transformer#from_disk

Parameters
  • path (str) – Path to a directory.

  • exclude (Iterable[str]) – String names of serialization fields to exclude.

  • num_labels (int) – Number of labels of the models. Required for reading the model into memory.

Returns

The loaded object.

Return type

(Transformer)

set_annotations(docs: Iterable[spacy.tokens.doc.Doc], predictions: spacy_transformers.data_classes.FullTransformerBatch) None[source]#

Assign the extracted features to the Doc objects. By default, the TransformerData object is written to the doc._.trf_data attribute. Your set_extra_annotations callback is then called, if provided. For more see https://spacy.io/api/pipe#set_annotations

Parameters
  • docs (Iterable[Doc]) – The documents to modify.

  • predictions (FullTransformerBatch) – A batch of activations.

dacy.subclasses.classification_transformer.ClassificationTransformerModel(name: str, get_spans: Callable, tokenizer_config: dict, num_labels: int) thinc.model.Model[List[spacy.tokens.doc.Doc], spacy_transformers.data_classes.FullTransformerBatch][source]#
Parameters
  • get_spans (Callable[[List[Doc]], List[Span]]) – A function to extract spans from the batch of Doc objects. This is used to manage long documents, by cutting them into smaller sequences before running the transformer. The spans are allowed to overlap, and you can also omit sections of the Doc if they are not relevant.

  • tokenizer_config (dict) – Settings to pass to the transformers tokenizer.

dacy.subclasses.classification_transformer.huggingface_classification_from_pretrained(source: Union[pathlib.Path, str], config: Dict, num_labels: int)[source]#

Create a Huggingface transformer model from pretrained weights. Will download the model if it is not already downloaded.

Parameters
  • source (Union[str, Path]) – The name of the model or a path to it, such as ‘bert-base-cased’.

  • config (dict) – Settings to pass to the tokenizer.

dacy.subclasses.classification_transformer.init(model: thinc.model.Model, X=None, Y=None)[source]#
dacy.subclasses.classification_transformer.install_classification_extensions(category: str, labels: list, doc_extension: str, force: bool)[source]#
dacy.subclasses.classification_transformer.install_extensions(doc_extension_attribute) None[source]#
dacy.subclasses.classification_transformer.make_classification_getter(category, labels, doc_extension)[source]#
dacy.subclasses.classification_transformer.make_classification_transformer(nlp: spacy.language.Language, name: str, model: thinc.model.Model[List[spacy.tokens.doc.Doc], spacy_transformers.data_classes.FullTransformerBatch], set_extra_annotations: Callable[[List[spacy.tokens.doc.Doc], spacy_transformers.data_classes.FullTransformerBatch], None], max_batch_items: int, doc_extension_attribute: str)[source]#

Construct a Transformer component, which lets you plug a model from the Huggingface transformers library into spaCy so you can use it in your pipeline. One or more subsequent spaCy components can use the transformer outputs as features in its model, with gradients backpropagated to the single shared weights.

Parameters
  • nlp (Language) – a SpaCy text processing pipeline

  • name (str) – The desired name of the component

  • model (Model[List[Doc], FullTransformerBatch]) – A thinc Model object wrapping the transformer. Usually you will want to use the TransformerModel layer for this.

  • set_extra_annotations (Callable[[List[Doc], FullTransformerBatch], None]) – A callback to set additional information onto the batch of Doc objects. The doc._.clf_trf_data attribute is set prior to calling the callback. By default, no additional annotations are set.

  • max_batch_items (int) – Max batch size

  • doc_extension_attribute (str) – Your desired doc extension

Returns

Your ClassificationTransformer component