Presidio can be extended to support detection of new types of PII entities, and to support additional languages. These PII recognizers could be added via code or ad-hoc as part of the request.
Entity recognizers are Python objects capable of detecting one or more entities in a specific language.
In order to extend Presidio's detection capabilities to new types of PII entities,
these EntityRecognizer
objects should be added to the existing list of recognizers.
The following class diagram shows the different types of recognizer families Presidio contains.
EntityRecognizer
is an abstract class for all recognizers.RemoteRecognizer
is an abstract class for calling external PII detectors.
See more info here.LocalRecognizer
is implemented by all recognizers running within the Presidio-analyzer process.PatternRecognizer
is an class for supporting regex and deny-list based recognition logic,
including validation (e.g., with checksum) and context support. See an example here.EntityRecognizer
.AnalyzerEngine
can use the new recognizer during analysis.For simple recognizers based on regular expressions or deny-lists,
we can leverage the provided PatternRecognizer
:
from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE",
deny_list=["Mr.","Mrs.","Miss"])
Calling the recognizer itself:
titles_recognizer.analyze(text="Mr. Schmidt", entities="TITLE")
Adding it to the list of recognizers:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
# Add the recognizer to the existing list of recognizers
registry.add_recognizer(titles_recognizer)
# Set up analyzer with our updated recognizer registry
analyzer = AnalyzerEngine(registry=registry)
# Run with input text
text="His name is Mr. Jones"
results = analyzer.analyze(text=text, language="en")
print(results)
Alternatively, we can add the recognizer directly to the existing registry:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)
results = analyzer.analyze(text=text, language="en")
print(results)
For pattern based recognizers, it is possible to change the regex flags, either for
one recognizer or for all.
For one recognizer, use the global_regex_flags
parameter
in the PatternRecognizer
constructor.
For all recognizers, use the global_regex_flags
parameter in the RecognizerRegistry
constructor:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
import regex as re
registry = RecognizerRegistry(global_regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE)
engine = AnalyzerEngine(registry=registry)
engine.analyze(...)
EntityRecognizer
in codeTo create a new recognizer via code:
Create a new Python class which implements LocalRecognizer.
(LocalRecognizer
implements the base EntityRecognizer class)
This class has the following functions:
i. load: load a model / resource to be used during recognition
def load(self)
ii. analyze: The main function to be called for getting entities out of the new recognizer:
def analyze(self, text, entities, nlp_artifacts)
Notes:
Each recognizer has access to different NLP assets such as tokens, lemmas, and more.
These are given through the nlp_artifacts
parameter.
Refer to the source code for more information.
The analyze
method should return a list of RecognizerResult.
Add it to the recognizer registry using registry.add_recognizer(my_recognizer)
.
For more examples, see the Customizing Presidio Analyzer jupyter notebook.
A remote recognizer is an EntityRecognizer
object interacting with an external service. The external service could be a 3rd party PII detection service or a custom service deployed in parallel to Presidio.
Sample implementation of a RemoteRecognizer
.
In this example, an external PII detection service exposes two APIs: detect
and supported_entities
. The class implemented here, ExampleRemoteRecognizer
, uses the requests
package to call the external service via HTTP.
In this code snippet, we simulate the external PII detector by using the Presidio analyzer. In reality, we would adapt this code to fit the external PII detector we have in hand.
For an example of integrating a RemoteRecognizer
with Presidio-Analyzer, see this example.
Once a recognizer is created, it can either be added to the RecognizerRegistry
via the add_recognizer
method, or it could be added into the list of predefined recognizers.
To add a recognizer to the list of pre-defined recognizers:
recognizers
in the default_recognizers
config. Details of recognizer parameters are given Here.On how to integrate Presidio with Azure AI Language PII detection service, and a sample for a Text Analytics Remote Recognizer, refer to the Azure Text Analytics Integration document.
In addition to recognizers in code, it is possible to create ad-hoc recognizers via the Presidio Analyzer API for regex and deny-list based logic.
These recognizers, in JSON form, are added to the /analyze
request and are only used in the context of this request.
The json structure for a regex ad-hoc recognizer is the following:
{
"text": "John Smith drivers license is AC432223. Zip code: 10023",
"language": "en",
"ad_hoc_recognizers":[
{
"name": "Zip code Recognizer",
"supported_language": "en",
"patterns": [
{
"name": "zip code (weak)",
"regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)",
"score": 0.01
}
],
"context": ["zip", "code"],
"supported_entity":"ZIP"
}
]
}
The json structure for a deny-list based recognizers is the following:
{
"text": "Mr. John Smith's drivers license is AC432223",
"language": "en",
"ad_hoc_recognizers":[
{
"name": "Mr. Recognizer",
"supported_language": "en",
"deny_list": ["Mr", "Mr.", "Mister"],
"supported_entity":"MR_TITLE"
},
{
"name": "Ms. Recognizer",
"supported_language": "en",
"deny_list": ["Ms", "Ms.", "Miss", "Mrs", "Mrs."],
"supported_entity":"MS_TITLE"
}
]
}
In both examples, the /analyze
request is extended with a list of ad_hoc_recognizers
, which could be either patterns
, deny_list
or both.
Additional examples can be found in the OpenAPI spec.
Recognizers can be loaded from a YAML file, which allows users to add recognition logic without writing code. An example YAML file can be found here.
Once the YAML file is created, it can be loaded into the RecognizerRegistry
instance.
This example creates a RecognizerRegistry
holding only the recognizers in the YAML file:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider
recognizer_registry_conf_file = "./analyzer/recognizers-config.yml"
provider = RecognizerRegistryProvider(
conf_file=recognizer_registry_conf_file
)
registry = provider.create_recognizer_registry()
analyzer = AnalyzerEngine(registry=registry)
results = analyzer.analyze(text="My name is Morris", language="en")
print(results)
This example adds the new recognizers to the predefined recognizers in Presidio:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
yaml_file = "recognizers.yaml"
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizers_from_yaml(yaml_file)
analyzer = AnalyzerEngine()
analyzer.analyze(text="Mr. and Mrs. Smith", language="en")
Further reading:
Вы можете оставить комментарий после Вход в систему
Неприемлемый контент может быть отображен здесь и не будет показан на странице. Вы можете проверить и изменить его с помощью соответствующей функции редактирования.
Если вы подтверждаете, что содержание не содержит непристойной лексики/перенаправления на рекламу/насилия/вульгарной порнографии/нарушений/пиратства/ложного/незначительного или незаконного контента, связанного с национальными законами и предписаниями, вы можете нажать «Отправить» для подачи апелляции, и мы обработаем ее как можно скорее.
Опубликовать ( 0 )