Estimating Overlap in Scaled Audiences: A Principled Approach Using I-Projection #

Introduction #

In Audience Intelligence, we often rely on observed data from a panel-collected via social media platforms, user devices, surveys, or third-party measurement providers-to understand behaviors and affinities across different audience segments.

Suppose we observe:

A: An audience segment of interest (e.g. people who follow a brand or visited a website)
B: A second audience (e.g. people who engaged with a product, followed an influencer, or purchased a category)
I: The observed overlap between A and B in the panel

From other data sources (e.g. CRM systems, census data, digital reach estimates), we might know:

A′: The estimated true size of audience A in the full population
B′: The estimated true size of audience B in the full population

We then wish to estimate:

I′: The expected intersection between A′ and B′ in the full population, consistent with the original observed relationship between A and B.

This extrapolation is not straightforward: naive methods (like linear scaling or assuming independence) can produce invalid results (e.g. (I′ > \min(A′, B′))) or ignore the statistical dependencies in the observed data.

We propose a principled, smooth, and closed-form solution based on information theory-specifically, the I-projection or minimum-KL divergence projection.

Problem Definition #

Let:

( P ): Total population size
( A ), ( B ), ( I ): Observed sizes in a panel
( A′ ), ( B′ ): Target (known or estimated) population sizes
( I′ ): Unknown intersection to estimate

We assume that the joint distribution of the binary variables “in A” and “in B” is preserved in a certain statistical sense when we move from the panel to the full population.

Method: KL-Minimizing I-Projection #

The I-projection seeks the distribution closest (in KL divergence) to the original panel joint distribution that matches the new marginals ( A′ ), ( B′ ).

For 2 binary variables, this projection has a unique and interpretable property:

It preserves the original odds ratio, while adjusting the marginals.

Step 1: Compute the observed odds ratio #

[ \theta = \frac{I \cdot (P - A - B + I)}{(A - I)(B - I)} ]

Step 2: Solve for the new intersection ( I′ ) #

Let ( N′ = P - A′ - B′ + I′ ). We want:

[ \frac{I′ \cdot N′}{(A′ - I′)(B′ - I′)} = \theta ]

This yields a quadratic equation in ( I′ ):

[ (1 - \theta) I′^2 + [P - A′ - B′ + \theta(A′ + B′)] I′ - \theta A′ B′ = 0 ]

Step 3: Choose the valid root #

Define:

[ a = 1 - \theta, \quad b = P - A′ - B′ + \theta(A′ + B′), \quad c = -\theta A′ B′ ]

Then:

[ I′ = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} ]

Pick the root within the valid bounds:

[ \max(0, A′ + B′ - P) \le I′ \le \min(A′, B′) ]

Why This Works #

This method:

Guarantees a valid estimate ( I′ ) within known logical bounds
Is nonlinear but smooth, avoiding sharp jumps
Preserves statistical dependence (via the odds ratio)
Has a closed-form solution via a single quadratic

This approach is equivalent to finding the maximum entropy estimate under fixed marginals and interaction, or solving for the I-projection in a log-linear model with fixed marginals.

Example #

Suppose:

( P = 1,000,000 )
( A = 10,000 ), ( B = 20,000 ), ( I = 2,000 )
( A′ = 50,000 ), ( B′ = 80,000 )

Then:

Compute the odds ratio:

[ \theta = \frac{2,000 \cdot (1,000,000 - 10,000 - 20,000 + 2,000)}{(10,000 - 2,000)(20,000 - 2,000)} = \cdots ]

Plug into the quadratic to find ( I′ )

Applications #

Estimating affinities in large audiences (e.g. affinity between brand audiences and interest groups)
Adjusting panel-based co-occurrence data to population-scale insights
Media planning and reach overlap estimation
Lookalike modeling or campaign targeting evaluation

Conclusion #

By modeling audience intersection scaling as an I-projection, we achieve a principled, mathematically sound, and interpretable method for estimating audience overlaps. This approach respects observed statistical structure while adjusting for new population marginals, critical for accurate, large-scale Audience Intelligence.