Data compliance for Data Scientists
Last updated
Last updated
Key words: Personal Identifiable Information (PII), General Data Protection Regulation (GDPR), confidential data, anonymization.
As a Data Scientist, I have a task to develop a ML-model for behavioral segmentation of book shop clients. It aims at introducing customer tiers to customize communication for each segment. During model developing I need confidential and personal information of my customer to attribute my tiers right. I do not know if I have to anonymize the data or I can use it as-is as my model will be used internally only.
The ODD Platform provides a PII-sensitive search mechanism to assist in identifying confidential data using , labels and and, therefore, preventing potential monetary, legal or reputational losses.
I start developing a new ML-model. Its data source has to have the following parameters: - customer age, gender, LTV (Lifetime Value), delivery address - customer payment details (card issuer, account's currency, card type) - transaction timestamp, payment type (card or cash) - preferred genres and authors
I find the following objects in the sources:
- Dim_Customers
: customer full name, date of birth, delivery address
- Dim_Books
: ISBN, author and genre
- Dim_Cards
: customer, card, card issuer name and currency
- Fct_transactions
: transaction date, book, payment type, transaction amount, quantity, currency, customer and card
- Dim_currency
: currency ISO3 code, currency name
- Dim_payment_types
: payment type, payment type description
I have designed ways of joining the above tables but do not know if I should anonymize any data.
I go to the ODD Platform and start searching for the tables I need.
I check objects’ tags, labels and metadata.
I find out that Dim_Customers
and Dim_Cards
objects cannot be stored. Customer full name, age, address and payment details should be anonymized as these are PII data protected by GDPR and PCI DSS.
Result: ML-model meets GDPR, PCI DSS and company’s compliance standards.