The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online.

How are users identified across data providers?

The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users.

In this talk learn how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.