Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images. It is fully customizable and pluggable, can be adapted to your needs and be deployed into various environments.
!!! note "Note" Presidio is a library or SDK rather than a service. It is meant to be customized to the user's or organization's specific needs.
!!! warning "Warning" Presidio can help identify sensitive/PII data in un/structured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.
By developing Presidio, our goals are:
The authors and maintainers of Presidio come from the Industry Solutions Engineering team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries.
!!! note "Note" Microsoft Presidio is not an official Microsoft product. Usage terms are defined in the repository's license.
In a nutshell, Presidio is a library which is meant to be customized, whereas different SaaS tools for PII detection have less customization capabilities. Most of these SaaS offerings use dedicated ML models and other logic for PII detection and often have better entity coverage or accuracy than Presidio.
Based on our internal research, leveraging Presidio in parallel to 3rd party PII detection services like Azure AI Language can bring optimal results mainly when the data in hand has entity types or values not supported by the 3rd party service. (see example here).
Presidio is a suite built of several packages and building blocks:
Users can customize Presidio in multiple ways:
And more.
Presidio supports spaCy version 3+ for Named Entity Recognition, tokenization, lemmatization and more. We also support Stanza using the spacy-stanza package, and it is further possible to create PII recognizers leveraging other frameworks like transformers or Flair.
For more information, see the docs.
Pseudonymization is a de-identification technique in which the real data is replaced with fake data in a reversible way. Since there are various ways and approaches for this, we provide a simple sample which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo.
Presidio-structured is a new capability in Presidio for detecting PII entities in structured/semi-structured data, and is still in alpha. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at presidio@microsoft.com
Presidio comes loaded with several PII recognizers (see list here), however its main strength lies in its customization capabilities to new entities, specific datasets, file types, languages or use cases.
Some PII recognizers are less specific than others. A driver's license number, for example, could be any 9-digit number. While Presidio leverages context words and other logic to improve the detection quality, it could still falsely detect non-entity values as PII entities.
In order to avoid false positives, one could try to:
Every PII identification logic would have its errors, and there is a trade-off between false positives (falsely detected text) and false negatives (PII entities which are not detected).
In addition to Presidio, we maintain a repo focused on evaluation of models and PII recognizers here. It also features a simple PII data generator.
The main Presidio modules (analyzer, anonymizer, image-redactor) can be used both as a Python package and as a dockerized REST API. See the different deployment samples for example deployments.
First, review the contribution guidelines, and feel free to reach out by opening an issue, posting a discussion or emailing us at presidio@microsoft.com
Please see the security information.
Вы можете оставить комментарий после Вход в систему
Неприемлемый контент может быть отображен здесь и не будет показан на странице. Вы можете проверить и изменить его с помощью соответствующей функции редактирования.
Если вы подтверждаете, что содержание не содержит непристойной лексики/перенаправления на рекламу/насилия/вульгарной порнографии/нарушений/пиратства/ложного/незначительного или незаконного контента, связанного с национальными законами и предписаниями, вы можете нажать «Отправить» для подачи апелляции, и мы обработаем ее как можно скорее.
Опубликовать ( 0 )