Augmenters#

dacy.augmenters.character#

SpaCy augmenters for character level augmentation.

dacy.augmenters.character.create_char_random_augmenter(doc_level: float, char_level: float, keyboard: Union[str, dacy.augmenters.keyboard.Keyboard] = 'QWERTY_EN') Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Creates an augmenter that replacies a character with a random character from the keyboard.

Parameters
  • doc_level (float) – probability to augment document.

  • char_level (float) – probability to augment character, if document is augmented.

  • keyboard (str, Keyboard, optional) – A Keyboard class or a string denoting a default keyboard from which replace characters are sampled from. Possible options for string include: “QWERTY_EN”: English QWERTY keyboard “QWERTY_DA”: Danish QWERTY keyboard Defaults to “QWERTY_EN”.

Returns

The augmenter function.

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.character.create_char_replace_augmenter(doc_level: float, char_level: float, replacement: dict) Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Creates an augmenter that replaces a character with a random character from the keyboard.

Parameters
  • doc_level (float) – probability to augment document.

  • char_level (float) – probability to augment character, if document is augmented.

  • replace (dict) – A dictionary denoting which characters denote potentials replacement for each character. E.g. {“æ”: “ae”}

Returns

The augmenter function.

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.character.create_char_swap_augmenter(doc_level: float, char_level: float) Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Creates an augmenter that swaps two characters in a token.

Parameters
  • doc_level (float) – probability to augment document.

  • char_level (float) – probability to augment character, if document is augmented.

Returns

The augmenter function.

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.character.create_keyboard_augmenter(doc_level: float, char_level: float, distance=1, keyboard: Union[str, dacy.augmenters.keyboard.Keyboard] = 'QWERTY_EN') Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Creates a document level augmenter using plausible typos based on keyboard distance.

Parameters
  • doc_level (float) – probability to augment document.

  • char_level (float) – probability to augment character, if document is augmented.

  • distance (int, optional) – keyboard distance. Defaults to 1.

  • keyboard (str, Keyboard, optional) – A Keyboard class or a string denoting a default keyboard. Possible options for string include: “QWERTY_EN”: English QWERTY keyboard “QWERTY_DA”: Danish QWERTY keyboard Defaults to “QWERTY_EN”.

Returns

The augmentation function

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.character.create_spacing_augmenter(doc_level: float, spacing_level: float) Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Creates an augmenter that removes spacing.

Parameters
  • doc_level (float) – probability to augment document.

  • spacing_level (float) – probability to remove spacing, if document is augmented.

Returns

The augmenter function.

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.danish#

Danish specific SpaCy augmenters.

dacy.augmenters.danish.create_æøå_augmenter(doc_level: float, char_level: float) Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Augments æøå into their spelling variants ae, oe, aa.

Parameters
  • doc_level (float) – probability to augment document.

  • char_level (float) – probability to augment character, if document is augmented.

Returns

The desired augmenter.

Return type

Callable[[Language, Example], Iterator[Example]]

dacy.augmenters.person#

Augmentation function for SpaCy which augments persons (PER) entities.

dacy.augmenters.person.augment_entity(entities: List[List[str]], ent_dict: Dict[str, List[str]], patterns: List[str], patterns_prob: Optional[List[float]], force_pattern_size: bool, keep_name: bool, prob: float) List[List[str]][source]#

Augment entities. For each entity to augment, randomly sample a pattern and apply transformation to the entity. See create_pers_augmenter.

Examples

>>> entities = [["Lasse", "Hansen"], ["Kenneth", "Christian", "Enevoldsen"]]
>>> ent_dict = {"first_name" : ["John", "Ole"], "last_name" : ["Eriksen"]}
>>> patterns = ["fn,ln", "abbpunct,ln"]
>>> augment_entity(entities, ent_dict, patterns, None, force_pattern_size=False, keep_name=True, prob=1)
[['L.', 'Hansen'], ['K.', 'Christian', 'Enevoldsen']]
>>> augment_entity(entities, ent_dict, patterns, None, force_pattern_size=True, keep_name=True, prob=1)
[['Lasse', 'Hansen'], ['K.', 'Christian']]
>>> augment_entity(entities, ent_dict, patterns, None, force_pattern_size=True, keep_name=False, prob=1)
[['Ole', 'Eriksen'], ['J.', 'Eriksen']]
>>> augment_entity(entities, ent_dict, patterns, None, force_pattern_size=False, keep_name=False, prob=1)
[['O.', 'Eriksen'], ['John', 'Eriksen', 'Enevoldsen']]
Returns

Augmented names

Return type

List[List[str]]

dacy.augmenters.person.create_pers_augmenter(ent_dict: Dict[str, List[str]], patterns: List[str], force_pattern_size: bool, keep_name: bool, patterns_prob: Optional[List[float]] = None, prob: float = 1) Callable[[spacy.language.Language, spacy.training.example.Example], Iterator[spacy.training.example.Example]][source]#

Create person augmenter

Parameters
  • ent_dict (Dict[str, List[str]]) – A dictionary with keys “first_name” and “last_name”. Values should be a list of names to sample from.

  • patterns (List[str]) – The patterns to replace names with. Should be a list of strings with each pattern in a string separated by a comma. Will choose one at random if more than one, optionally weighted by pattern_probs. Options: “fn”, “ln”, “abb”, “abbpunct”. . “fn” = first name “ln” = last name “abb” = abbreviate to first character (e.g. Lasse -> L) “abbpunct” = abbreviate to first character including punctuation (e.g. Lasse -> L.). Patterns can be arbitrarily combined, e.g. [“fn,ln”, “abbpunct,ln,ln,ln”]

  • force_pattern_size (bool) – Whether to force entities to have the same format/length as the pattern. Defaults to False.

  • keep_name (bool) – Whether to use the current name or sample from ent_dict. I.e., if True, will only augment if the pattern is “abb” or “abbpunct”, if False, will sample new names from ent_dict. Defaults to True.

  • patterns_prob (List[float]) – Defaults to None (equal weights)

  • prob (float, optional) – which proportion of entities to augment. Defaults to 1.

Returns

The augmenter

Return type

Callable[[Language, Example], Iterator[Example]]

>>> from dacy.dataset import danish_names
>>> name_dict = danish_names()
>>> pers_aug = create_pers_augmenter(name_dict, patterns=["fn,ln","abbpunct,ln"], force_pattern_size=True, keep_name=False)

dacy.augmenters.keyboard#

Functions for character augmentation based on keyboard layout.

class dacy.augmenters.keyboard.Keyboard(*, keyboard_array: Dict[str, List[List[str]]], shift_distance: int = 3)[source]#

Bases: pydantic.main.BaseModel

A Pydantic dataclass object for constructing Keyboard setup.

Parameters
  • keyboard_array (Dict[str, str]) – An array corresponding to a keyboard. This should include two keys a “default” and a “shifted”. Each containing an array of non-shifted and shifted keys respectively.

  • shift_distance (int) – The distance given by the shift operator. Defaults to 3.

Returns

a Keyboard object

Return type

Keyboard

all_keys()[source]#

yields all keys in keyboard.

Yields

all keys in keyboard.

coordinate(key: str) Tuple[int, int][source]#

get coordinate for key

Parameters

key (str) – keyboard key

Returns

key coordinate on keyboard

Return type

Tuple[int, int]

create_distance_dict(distance: int = 1) dict[source]#
euclidian_distance(key_a: str, key_b: str) int[source]#

Returns euclidian distance between two keys

Parameters
  • key_a (str) – keyboard key

  • key_b (str) – keyboard key

Returns

The euclidian distance between two keyboard keys.

Return type

int

get_neighboors(key: str, distance: int = 1) Set[int][source]#

gets the neighbours of a key with a specified distance.

Parameters
  • key (str) – A keyboard key

  • distance (int, optional) – The euclidian distance of neightbours. Defaults to 1.

Returns

The neighbours of a key with a specified distance.

Return type

Set[int]

is_shifted(key: str) bool[source]#

is the key shifted?

Parameters

key (str) – keyboard key

Returns

a boolean indicating whether key is shifted.

Return type

bool

keyboard_array: Dict[str, List[List[str]]]#
shift_distance: int#