I ran across an interesting blog post that describes an algorithm that clusters data in a way that maintains differential privacy. The post is titled "Practical Differentially Private Clustering". See https://ai.googleblog.com/2021/10/practical-differentially-private.html.

Differential privacy is complicated and cannot be explained quickly. But loosely, one interpretation is that an algorithm is differentially private if you look at some output and can't tell if a particular person's information was used in the computation. For example, suppose you have a artificially tiny dataset of just 4 items:

  ID    Name    Age  -----------------  001   Adams   40  002   Baker   30  003   Chang   50  004   Dunne   20  

And suppose you can query for average age of two or more people. If the average age of people 001-003 is 40.0 (therefore sum of ages is 120.0) and the average age of people 001-004 is 35.0 (therefore sum of ages is 140.0) then the age of person 004 is the difference of the sums = 140.0 - 120.0 = 20.0 years.

The standard way to make an algorithm differentially private is to add random noise. for example, the average age return result might have a random value that can be positive or negative added to the result. So, instead of the average age of people 001-003 being exactly 40.0 the return result might be 38.3 or 41.7 or something similar.


I grabbed some screenshots of an animation on the blog site that describes the clustering algorithm.

For data clustering, you group data items into k bins. Each bin has a representative value of some sort that represents all the data items in the cluster. Standard clustering algorithms, such as k-means or dbscan, are not differentially private. As a degenerate case, suppose an extreme value becomes its own k-means cluster of just one item -- you know everything about the item from the cluster mean.

The differentially private clustering algorithm presented in the blog post first generates a coreset that consists of weighted items (with noise added) that represent the source data items well. The coreset is created using a technique called locality-sensitive hashing (LSH). This process is differentially private because it injects some randomness. After the coreset is created, any standard clustering algorithm can be applied. The result is clustering that is differentially private. Simple and elegant.

Note that like all differentially private algorithms, the clustering results are only approximations because of the random noise that is added. But that's the price you pay for differentially private algorithms.



Some of my friends are very knowledgeable about jewelry. One of them pointed out that advertising for necklaces almost always uses "differentially private" models, meaning that the models' faces are deliberated omitted. This is probably so that viewers focus on the necklace product instead of the model, rather than for privacy reasons.