Learning | Perched on the Shoulders of Giants…

Tot el que necessites saber sobre k-anonimitat, l-diversity i t-closeness, amb exemples reals pas a pas i consells per començar a fer servir l’eina d’anonimització ARX.

Recerca & MetodologiaTemps de lectura: ~15 minNivell: IntermediEina: ARX v3.9+

Si treballes amb dades de pacients, enquestes o qualsevol dataset que contingui informació personal, anonimitzar correctament és una obligació legal i ètica. ARX és l’eina de referència en recerca clínica i biomèdica, gratuïta i de codi obert. Aquesta guia t’explica com funciona i com usar-la des de zero.

Què és ARX i per a qui és útil?

ARX (ARX Data Anonymization Tool) és una eina de codi obert desenvolupada per Florian Prasser i col·laboradors, dissenyada específicament per a la anonimització de dades tabulars. Té interfície gràfica (GUI) i API per a Java, cosa que la fa accessible tant per a investigadors sense coneixements de programació com per a equips de data science que volen automatitzar el procés.

És especialment popular en:

Recerca clínica i biomèdica: datasets de pacients, histories clíniques, estudis epidemiològics
Ciències socials: enquestes, dades socioeconòmiques, censos
Tesis doctorals que treballen amb dades personals i han de complir amb el RGPD
Publicació de datasets oberts en repositoris institucionals o Zenodo

Per on començar

Descarrega ARX gratuïtament a arx.deidentifier.org/downloads. Requereix Java 11 o superior. A la mateixa pàgina trobaràs un projecte d’exemple que pots obrir directament per explorar la interfície.

Pas 1: Classificar els atributs del dataset

Abans de tocar cap paràmetre, cal entendre que no totes les columnes del teu dataset creen el mateix risc. ARX distingeix cinc tipus d’atributs:

Tipus a ARX	Risc	Acció automàtica	Exemple típic
Identifying	Molt alt	Elimina la columna del dataset de sortida	DNI, número de SS, nom complet
Quasi-identifying (QI)	Mig (perillós en combinació)	Generalitza o suprimeix	Edat, codi postal, sexe
Sensitive	Alt per inferència	Protegit pel model de privacitat triat	Diagnòstic, salari, addicció
Insensitive	Negligible	Es manté sense canvis	Medicació genèrica, grup sanguini
Response variable	Context-dependent	Tractat com insensible per defecte	Variable de resultat clínic

Per què els quasi-identificadors son tan perillosos?

Latanya Sweeney va demostrar l’any 2000 que combinant data de naixement + sexe + codi postal, es pot identificar el 87% de la població dels EUA. A Espanya, la situació és similar: el codi postal de 5 dígits combinat amb edat i sexe pot ser suficient per identificar individus en zones poc poblades.

Exemple real: l’atac de combinació

Imagina que publiques una llista d’altes hospitalàries amb edat, codi postal i sexe (sense nom ni DNI). Un atacant pot creuar aquesta llista amb el cens electoral del mateix districte i identificar la majoria de pacients. Eliminar els identificadors directes no és suficient.

Pas 2: Entendre els models d’atac

ARX no aplica una protecció genèrica: et demana que decideixis contra quin perfil d’atacant vols protegir-te. Són tres models ben diferenciats:

Model	L’atacant sap que…	Quan usar-lo
Prosecutor	Un individu concret és al dataset i vol confirmar-ho	Dades molt sensibles, obligació legal de protegir individus específics
Journalist	Hi ha algú al dataset que compleix un perfil i vol trobar qui és	Publicació de dades obertes o en repositoris acadèmics
Marketer	El dataset existeix i vol re-identificar el màxim d’individus	Publicació pública massiva, datasets per a ML

Per a recerca doctoral amb dades clíniques, el mínim recomanable és el model Journalist. Si el dataset conté dades especialment sensibles (salut mental, VIH, addiccions), considera el model Prosecutor.

Pas 3: k-Anonimitat — el fonament

La k-anonimitat és el model de privacitat base que gairebé sempre aplicaràs. La idea és senzilla: cap individu ha de poder distingir-se de com a mínim k-1 altres persones en el dataset.

Per aconseguir-ho, ARX generalitza els quasi-identificadors (substitueix valors exactes per rangs o categories més amples) fins que cada combinació de QI apareix almenys k vegades. A aquests grups se’ls anomena classes d’equivalència.

Exemple pràctic: de valors exactes a classes d’equivalència

Tenim 6 pacients. Columnes QI: Edat, Codi Postal, Sexe. Atribut sensible: Diagnòstic.

ID	Edat orig.	Edat (k=2)	Codi Postal orig.	CP (k=2)	Sexe	Diagnòstic
P001	29	20–30	08001	080**	F	Ins. cardíaca
P002	31	20–30 → 30–40	08001	080**	F	Hipertensió
P003	45	40–50	08010	080**	M	Fibril. auricular
P004	47	40–50	08010	080**	M	Fibril. auricular
P005	52	50–60	08015	080**	F	Cardiopatia isq.
P006	54	50–60	08015	080**	F	Hipertensió

Ara P001 i P002 formen una classe (si generalitzem prou l’edat), P003 i P004 en formen una altra, i P005 i P006 una tercera. Amb k=2 cap individu es pot distingir de l’altre dins de la seva classe.

Quant val de k has de triar?

k	Protecció	Pèrdua d’info	Recomanat per a…
k = 2	Mínima	Baixa	Ús intern en consorcis tancats
k = 3–5	Bona	Moderada	Publicació acadèmica estàndard
k ≥ 10	Alta	Alta	Requisits HIPAA, dades molt sensibles

Atenció: la trampa de k-anonimitat

Si tots els membres d’una classe d’equivalència tenen el mateix diagnòstic, un atacant pot inferir-lo sense necessitat d’identificar ningú. Exemple: si tots els homes de 40–50 anys de Barcelona del dataset han estat hospitalitzats per fibril·lació auricular, saber que algú pertany a aquest grup ja revela el diagnòstic. Aquí entra l-diversity.

Pas 4: l-Diversity — protegir el diagnòstic

La l-diversity afegeix un requisit sobre l’atribut sensible: dins de cada classe d’equivalència, hi ha d’haver almenys l valors ben representats de l’atribut sensible. Això evita que un atacant pugui inferir el diagnòstic, addicció o qualsevol altra dada sensible fins i tot sense saber qui és l’individu.

Les tres variants principals

Distinct l-diversity — la més simple: almenys l valors distints per classe. Suficient quan tots els valors de l’atribut sensible son igualment sensibles.
Entropy l-diversity — la més robusta: l’entropia de Shannon de la distribució de l’atribut sensible ha de ser ≥ log(l). Detecta casos on un valor domina fins i tot si n’hi ha l de distints.
Recursive (c, l)-diversity — intermèdia: el valor més freqüent no pot concentrar massa quota relativa respecte als altres.

Càlcul d’entropy l-diversity: exemple pas a pas

Tenim una classe amb 4 pacients i diagnòstics: Fibril·lació auricular (×2), Cardiopatia isquèmica (×1), Hipertensió (×1). Comprovem si satisfà entropy 2-diversity:

Càlcul

# Distribució de l'atribut sensible a la classe
Fibril. auricular:    2/4 = 0.50
Cardiopatia isq.:     1/4 = 0.25
Hipertensió:          1/4 = 0.25

# Entropy de Shannon
H = -(0.50 × log₂(0.50)) - (0.25 × log₂(0.25)) - (0.25 × log₂(0.25))
H = 0.50 + 0.50 + 0.50 = 1.50 bits

# Requisit per a Entropy 2-diversity: H ≥ log₂(2) = 1.0
1.50 ≥ 1.0  →  SATISFET ✓

Recomanació per a dades clíniques

Usa Entropy l-diversity amb l = 3 per a datasets clínics on hi ha diagnòstics molt prevalents (com la hipertensió o la diabetis). La variant Distinct pot ser insuficient si un diagnòstic concentra el 70–80% dels registres d’una classe.

Pas 5: t-Closeness — l’últim escut

Fins i tot amb l-diversity, pot passar que una classe d’equivalència tingui una distribució de diagnòstics molt diferent de la distribució global del dataset. Si un diagnòstic molt rar a la població general és molt freqüent en una classe concreta, un atacant que sap que algú pertany a aquella classe pot inferir el diagnòstic amb alta probabilitat.

t-Closeness exigeix que la distribució de l’atribut sensible dins de cada classe no difereixi en més de t de la distribució global. La distància s’avalua amb Earth Mover’s Distance (EMD).

Intuïció visual

Pensa-ho com un embut d’arena: la distribució global és la forma que té la platja (30% IC, 20% FA, 20% HTA, 30% Cardiopatia). Cada classe és un got ple d’arena. t-Closeness exigeix que la forma de l’arena al got s’assembli prou a la de la platja.

Exemple: calcular t per a una classe

# Classe CE-6: {P008, P009} — homes 60–70 anys
# Diagnòstics: Cardiopatia isquèmica (50%), Ins. cardíaca (50%)

Distribució global de referència:
  IC: 30%   FA: 20%   HTA: 20%   Card.Isq: 30%

Distribució local CE-6:
  IC: 50%   FA: 0%    HTA: 0%    Card.Isq: 50%

# EMD = suma de diferències absolutes / 2
|50-30| + |0-20| + |0-20| + |50-30| = 80
EMD = 80 / 2 = 0.40

Compleix t=0.20? NO (0.40 > 0.20)
Compleix t=0.50? SÍ (0.40 ≤ 0.50)

Quin valor de t triar?

Valor de t	Protecció	Pèrdua d’info	Cas d’ús
t = 0.05–0.10	Molt alta	Molt alta	Dades extremadament sensibles (VIH, salut mental)
t = 0.15–0.20	Alta	Moderada–Alta	Dades clíniques per a publicació pública
t = 0.25–0.35	Moderada	Baixa–Moderada	Dades de recerca per a ús intern

Pas 6: Usar la GUI d’ARX pas a pas

ARX organitza el flux de treball en quatre perspectives visuals. Les recorres en ordre: Configuració → Exploració → Utilitat → Riscos.

Instal·lació en 2 minuts

Descarregar — Ves a arx.deidentifier.org/downloads i baixa el ZIP de l’última versió estable.
Verificar Java — Obre un terminal i escriu java -version. Necessites Java 11+. Si no el tens, descarrega’l de adoptium.net (gratuït).
Executar — Fes doble clic sobre arxanonymizer.jar. Si no s’obre, des del terminal: java -jar arxanonymizer.jar

Perspectiva 1: Configuració

Aquí defineixes el dataset i les jerarquies de generalització.

Importar el CSV — File > Import Data > CSV File. Configura separador (coma), encoding (UTF-8) i activa “First row contains header”.
Assignar tipus — Clic dret sobre cada columna > Attribute type. Marca els QI com a Quasi-identifying, el diagnòstic com a Sensitive, els identificadors directes com a Identifying.
Crear jerarquies — Per a cada QI, clic dret > Edit Hierarchy. Per a l’edat usa “Order-based” amb intervals de 10 anys. Per al codi postal usa “Masking-based” (080** → 08*** → *).
Configurar el model — Panel dret > Add criterion. Afegeix k-Anonymity (k=5), Distinct l-Diversity (l=3, atribut=Diagnòstic) i opcionalment t-Closeness (t=0.20).

jerarquia_cp.csv — exemple

# Format: valor_original, nivell1, nivell2, supressió_total
08001,080**,08***,*
08002,080**,08***,*
08010,080**,08***,*
08015,080**,08***,*
08020,080**,08***,*
08030,080**,08***,*

Perspectiva 2: Exploració (el lattice)

ARX construeix un lattice de transformacions: un graf on cada node és una combinació possible de nivells de generalització. Els nodes verds compleixen els criteris, els vermells no. El node recomanat (millor utilitat + privacitat) apareix ressaltat.

Consell pràctic

Si el lattice té molts nodes vermells, el teu dataset és massa petit o els QI massa específics per a la k escollida. Prova a reduir k en un nivell o a augmentar la granularitat de les jerarquies (intervals d’edat de 20 anys en lloc de 10).

Perspectiva 3: Utilitat

Comprova que el dataset anonimitzat segueix sent vàlid per a les teves anàlisis. ARX mostra histogrames comparatius (original vs. anonimitzat) i les mètriques clau:

Mètrica	Valor ideal	Alerta si…
Non-Uniform Entropy Loss	< 30%	> 50%: les anàlisis estadístiques poden ser invàlides
Registres suprimits	< 10%	> 20%: revisa les jerarquies o redueix k
Mida mitja de les classes	Pròxima a k	Molt major que k: possible sobreprotecció

Perspectiva 4: Riscos

Aquí és on demostres al CEI i als revisors que la protecció és real. ARX calcula els tres riscos de re-identificació:

Indicador	Model	Acceptable per publicació
Highest risk (individual)	Prosecutor	< 0.33
Success rate (re-id)	Journalist	< 0.20
Expected risk	Marketer	< 0.10

Resum i checklist final

Abans de donar per acabat el procés d’anonimització, comprova cada punt d’aquesta llista:

#	Acció	Documentat?
1	Atributs classificats i validats amb el director/a i el CEI	☐
2	Jerarquies de generalització creades i exportades com a CSV	☐
3	Model de privacitat configurat: k ≥ 5 + l ≥ 3 + t ≤ 0.20	☐
4	Pèrdua d’informació < 30% i registres suprimits < 10%	☐
5	Riscos de re-identificació dins dels llindars acceptables	☐
6	Fitxer .arx guardat per a reproductibilitat	☐
7	Secció metodològica d’anonimització documentada	☐

Combinació recomanada per a dades clíniques

k = 5 (k-anonimitat) per a publicació acadèmica estàndard
Entropy 3-diversity per a diagnòstics amb distribució no uniforme
t = 0.20 (t-closeness) per a dades amb diagnòstics rars o molt prevalents
Model d’atacant: Journalist com a mínim per a publicació oberta

Continua aprenent sobre privacitat de dades en recerca

Tens el document complet amb exemples guiats, exercicis i plantilles, lliure per descarregar.

Descarregar la guia completa (Word) Descarregar ARX gratuïtament

Referències: Prasser F. et al. (2020). Flexible Data Anonymization Using ARX. Software: Practice and Experience. · Sweeney L. (2002). k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty. · Li N. et al. (2007). t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE ICDE. · Machanavajjhala A. et al. (2007). l-Diversity: Privacy Beyond k-Anonymity. ACM TKDD.

by Jean-Christophe Baillie President at Novaquark

What is AI and what is not AI is, to some extent, a matter of definition. There is no denying that AlphaGo and similar deep learning approaches have managed to solve quite hard computational problems in the last years. But is it going to get us to AI, in the sense of a fully general intelligent machine, or “AGI”? Not quite, and here is why.

One of the key issues when building an artificial general intelligence is that it will have to make sense of the world for itself, to develop its own, internal meaning for everything it will encounter, hear, say and do. Failing to do this, you end up with today’s AI programs where all the meaning is actually provided by the designer of the application: the AI basically doesn’t understand what is going on and has a narrow domain of expertize.

The problem of meaning is perhaps the most fundamental problem of AI and has still not been solved today. One of the first to express it was Harnad, in his 1990 paper about “The Symbol Grounding Problem”. Even if you don’t believe we are explicitly manipulating symbols, which is indeed questionable, the problem remains: the grounding of whatever representation exists inside the system into the real world outside.

To be more specific, the problem of meaning leads us to four sub-problems:

How do you structure the information the agent (human or AI) is receiving from the world?
How do you link this structured information to the world, or, taking the above definition, how do you build “meaning” for the agent?
How do you synchronize this meaning with other agents? (otherwise, there is no communication possible and you get an incomprehensible isolated form of intelligence)
Why does the agent do something at all rather than nothing? How to set all this into motion?

The first problem, about structuring information, is very well addressed by deep learning and similar unsupervised learning algorithms, used for example in the AlphaGo program. We have made tremendous progress in this area, in part because of the recent gain in computing power and the use of GPU (Graphical Processing Units) which are especially good at parallelizing information processing. What these algorithms do is taking a signal that is extremely redundant and expressed in a high dimension space, and reduce it to a low dimensionality signal, minimizing the loss of information in the process. In other words, it “captures” what is important in the signal, from an information processing point of view.

The second problem, about linking information to the real world, or creating “meaning”, is fundamentally tied to robotics. Because you need a body to interact with the world, and you need to interact with the world to build this link. That’s why I often say that there is no AI without robotics (while there can be pretty good robotics without AI, but that’s another story). This realization is often called the “embodiment problem” and most researchers in AI now agree that intelligence and embodiment are tightly coupled issues. Every different body has a different form of intelligence and you see that pretty clearly in the animal kingdom. It starts with simple things like making sense of your own body parts, and how you can control them to produce desired effects in the observed world around you, how you build your own notion of space, distance, color, etc. This has been studied extensively by researchers like Kevin O’Regan and his “sensorimotor theory”. It is just a first step however, because then you have to build up more and more abstract concepts, on top of those grounded sensorimotor structures. We are not quite there yet, but that’s the current status of research on that matter.

The third problem is fundamentally the question of the origin of culture. Some animals show some simple form of culture, even transgenerational acquired competencies, but it is very limited and only humans have reached the threshold of exponentially growing acquisition of knowledge that we call culture. Culture is the essential catalyst of intelligence and an AI without the capability to interact culturally would be nothing more than an academic curiosity. However, culture can not be hand coded into a machine, it must be the result of a learning process. The best way to start looking to try to understand this process is in developmental psychology, with the work of Piaget or Tomasello, studying how children acquire cultural competency. It gave birth to a new discipline in robotics called “developmental robotics”, which is taking the child as a model (as illustrated by the iCub robot, pictured above). It is also closely linked to the study of language learning, which is one of the topic on which I mostly worked as a researcher myself. The work of people like Luc Steels and many others have shown that we can see language acquisition as an evolutionary process: the agent creates new meanings by interacting with the world, use them to communicate with other agents, and select the most successful structures that help to communicate (that is, to achieve joint intentions, mostly). After hundreds of trial and error, just like with biological evolution, the system evolves the best meaning and their syntactic/grammatical translation. This process has been tested experimentally and shows striking resemblances with how natural languages evolve and grow. Interestingly, it accounts for instantaneous learning, when a concept is acquired in one shot, something that heavily statistical models like deep learning are not capable to explain. Several research labs are now trying to go further into acquiring grammar, gestures and more complex cultural conventions by this mean, in particular the AI Lab that I founded at Aldebaran.

Finally, the fourth problem deals with what is called “intrinsic motivation”. Why does the agent do anything at all, rather than nothing. Survival requirements are not enough to explain human behavior. Even perfectly fed and secure, humans don’t just sit idle until hunger comes back. There is more, they explore, they try, and it seems to be some kind of intrinsic curiosity. Researchers like Pierre-Yves Oudeyer have shown that simple mathematical formulations of curiosity, as an expression of the tendency of the agent to maximize its rate of learning, are enough to account for incredibly complex and surprising behaviors (see, the Playground experiment done at Sony CSL). It seems that something of the sort is needed inside the system to drive its desire to go through the previous three steps: structure the information of the world, connect it to its body and create meaning, and then select the most communicationally efficient one to create a joint culture that enables cooperation. This is, in my view, the program of AGI.

Again, the advances of deep learning and the recent success of this kind of AI at the game of Go are very good news because lots of very useful applications can be imagined from there to help medical research, the industry, progress in environment preservation and many other issues. But this is only one part of the problem, as I have tried to show here. I don’t believe deep learning is the silver bullet that will get us to true AI, in the sense of a machine that is capable to learn to live in the world, interact naturally with us, understand deeply the complexity of our emotions, cultural biases and ultimately help us to make a better world.

This article originally appeared on LinkedIn

Perched on the Shoulders of Giants…

Category Archives: Learning

ARX Anonymization Tool: guia pràctica per anonimitzar dades de recerca