Science News

How generalizing the notion of VC dimension to a strategic setting can help us understand whether or not a problem is learnable

Image generated by the author using DALL-E 3.

In the first article in this series, we formally defined the strategic classification problem, denoted Sᴛʀᴀᴄ⟨H, R, c, as a generalization of canonical binary classification. We did so based on the paper PAC-Learning for Strategic Classification (Sundaram, Vullikanti, Xu, & Yao, 2021). Along the way, we explored why we should care about considering the various preferences of rational agents during classification and how we can do so (subject to certain assumptions). We will rely heavily on the concepts introduced in the previous article, so I encourage you to read it if you haven’t already.

Extending PAC Learning to a Strategic Classification Setting

We’ll pick up where we left off, using the definition of Sᴛʀᴀᴄ⟨H, R, c⟩ as a jumping-off point for the useful concept of strategic VC dimension (SVC). Once we make sense of SVC, what I call the Fundamental Theorem of Strategic Learning will follow rather naturally.

While helpful, prior familiarity with shattering coefficients, the canonical VC dimension, and the Fundamental Theorem of Statistical Learning will not be necessary for you to follow along. However, there’s far more depth to each of them than I could ever hope to cover as part of this series, let alone in a single article. The curious reader is referred to Andrew Rothman’s wonderful and very thorough articles on the (canonical) shattering coefficient and VC dimension.

Statistical Learning Theory Part 5: Shattering CoefficientStatistical Learning Theory Part 6: Vapnik–Chervonenkis (VC) Dimension

As we’ll see, strategic shattering coefficients and SVC are fairly natural generalizations of their canonical (i.e., non-strategic) counterparts. We will therefore begin with a brief rundown of each of those counterparts before explaining how they can be modified to work in a strategic setting.

Counting Achievable Labelings: Canonical Shattering Coefficients

Verbally defining shattering coefficients seems straightforward at first glance:

Given a hypothesis class H, its nᵗʰ shattering coefficient, denoted Sₙ(H), represents the largest number of labelings achievable by classifiers in H on a sample of n feature vectors.

But what is a “labeling”? And what makes it “achievable”? Answering those questions will help us lay some groundwork in pursuit of a more formal definition.

In the context of binary classification, a labeling of a sample of feature vectors is simply any one of the ways we can assign values from the set { -1, 1 } to those vectors. As a very simple example, consider two one-dimensional feature vectors (i.e., points on a number line), x₁ = 1 and x₂ = 2.

A visualization of the four possible labelings of the sample x₁ = 1, x₂ = 2. Red points are negatively classified, blue ones are positively classified. Image by the author.

The possible labelings are any combination of the classification values we can assign the individual feature vectors independent of one another. We can represent each labeling as a vector, where the first and second coordinate represent the values assigned to x₁ and x₂, respectively. The set of possible labelings is thus { (-1, -1), (-1, 1), (1, -1), (1, 1) }. Note that a sample of size 2 yields 2² = 4 possible labelings — we’ll see how this generalizes to arbitrarily-sized samples soon.

We say that a labeling is achievable by a hypothesis class H if there exists a classifier hH from which that labeling can result. Continuing with our simple example, suppose we are limited to classifiers of the form xk, k ∈ ℝ, that is, one-dimensional thresholds such that anything to the right of the threshold is positively classified. The labeling (1, -1) is not achievable by this hypothesis class. x₂ being greater than x₁ implies that any threshold that classifies x₁ positively must do the same for x₂. The set of achievable labelings is therefore { (-1, -1), (-1, 1), (1, 1) }.

Examples of one-dimensional threshold classifiers that can be used to achieve all but one of the possible labelings of the sample x₁ = 1, x₂ = 2. Image by the author.

Having understood the basic terminology, we can start to develop some notation to formally express elements of the verbal definition with which we started.

We stick to representing labelings as vectors as we did in our simple example, with each coordinate representing the classification value assigned to the corresponding feature vector. There are 2 possible labelings in total: there are two possible choices for each feature vector, and we can think of a labeling as a collection of n such choices, each made independently of the rest. If a hypothesis class H can achieve all possible labelings of a sample 𝒞, i.e., if the number of achievable labelings of 𝒞 equals 2, we say that H shatters 𝒞ₙ.

Finally, using the notation from above, we converge on a more rigorous definition of Sₙ(H):

In keeping with our explanation of shattering, Sₙ(H) equalling 2 implies that there exists a sample of size n that is shattered by H.

Estimating Hypothesis Class Expressiveness: Canonical VC Dimension

The Vapnik–Chervonenkis (VC) dimension is a way to gauge the expressive power of a hypothesis class. It’s based on the idea of shattering we just defined, and it plays an important role in helping us determine which hypothesis classes are PAC learnable and which aren’t.

Let’s begin by attempting to intuitively define the canonical VC dimension:

Given a hypothesis class H, its VC dimension, denoted VCdim(H), is defined to be the greatest natural number n for which there exists a sample of size n that is shattered by H.

Using Sₙ(H) enables us to express this much more cleanly and succinctly:

VCdim(H) = max{ n ∈ ℕ : Sₙ(H) = 2 }

However, this definition isn’t precise. Note that the set of numbers for which the shattering coefficient equals 2may be infinite. (Consequently, it is possible that VCdim(H) = ∞.) If that’s the case, the set has no well-defined maximum. We address this by taking the supremum instead:

VCdim(H) = sup{ n ∈ ℕ : Sₙ(H) = 2 }

This rigorous and concise definition is the one we’ll use moving forward.

Adding Preferences to the Mix: Strategic Shattering Coefficients

Generalizing the canonical notions we just went over so that they work in a strategic setup is fairly straightforward. Redefining shattering coefficients in terms of the data point best response we defined in the previous article is practically all we’ll have to do.

Given a hypothesis class H, a preference set R, and a cost function c, the nᵗʰ shattering coefficient of Sᴛʀᴀᴄ⟨H, R, c⟩, denoted σ(H, R, c), represents the largest number of labelings achievable by classifiers in H on a set of n potentially-manipulated feature vectors, i.e., n data point best responses.

As a reminder, this is how we defined the data point best response:

We can tweak the notation we used in our discussion of canonical shattering coefficients to further formalize this:

The main difference is that each x in the sample has to have a corresponding r. Other than that, putting the data point best response where we had x in the canonical case works smoothly.

As a quick sanity check, let’s consider what happens if R = { 0 }. The realized reward term 𝕀(h(z) = 1) ⋅ r will be 0 across all the data points. Maximizing utility thus becomes synonymous with minimizing cost. The best way to minimize the cost incurred by a data point is trivial: never manipulating its feature vector.

Δ(x, r; h) ends up always just being x, placing us firmly within the territory of canonical classification. It follows that σ(H, { 0 }, c) = Sₙ(H) for all H, c. This is consistent with our observation that the impartial preference class represented by R = { 0 } is equivalent to canonical binary classification.

Expressiveness with Preferences: Strategic VC Dimension (SVC)

Having defined the nᵗʰ strategic shattering coefficient, we can simply swap out the Sₙ(H) in the canonical definition of the VC dimension for σ(H, Rc).

SVC(H, R, c) = sup{ n ∈ ℕ : σ(H, R, c) = 2 }

Based on the example we considered above, we find that SVC(H, { 0 }, c) = VCdim(H) for any H, c. Indeed, SVC is to VCdim as the strategic shattering coefficient is to its canonical equivalent: both are elegant generalizations of non-strategic concepts.

From SVC to Strategic PAC Learnability: The Fundamental Theorem of Strategic Learning

We can now use SVC to state the Fundamental Theorem of Strategic Learning, which relates the complexity of a strategic classification problem to its (agnostic) PAC learnability.

A strategic classification instance Sᴛʀᴀᴄ⟨H, R, c⟩ is agnostic PAC learnable if and only if SVC(H, R, c) is finite. The sample complexity for strategic agnostic PAC learning is m(δ, ε) ≤ ⁻² ⋅ (SVC(H, R, c) + log⁡(1/δ)), with C being a constant.

We won’t elaborate too much on how this can be proven. Suffice it to say that it boils down to a clever reduction to the (well-documented) Fundamental Theorem of Statistical Learning, which is essentially the non-strategic version of the theorem. If you’re mathematically inclined and interested in the nuts and bolts of the proof, you can find it in Appendix B of the paper.

This theorem essentially completes our generalization of classic PAC learning to a strategic classification setting. It shows that the way we defined SVC actually doesn’t just make sense in our heads; it actually works as a generalization of VCdim where it matters most. Armed with the Fundamental Theorem, we are well-equipped to analyze strategic classification problems much as we would any old binary classification problem. In my opinion, having the ability to determine whether a strategic problem is theoretically learnable or not is pretty incredible!

Conclusion

We began by introducing the canonical shattering coefficient and VC dimension, which are central to PAC learning (and the Fundamental Theorem of Statistical Learning). Next, we leveraged the data point best response to generalize the aforementioned concepts so that they would work in our strategic setup. We defined the strategic VC dimension (SVC) and showed that when faced with the impartial preference class, it degenerates back into the canonical VC dimension. Finally, we demonstrated how SVC relates to strategic PAC learnability by means of the Fundamental Theorem of Strategic Learning.

In the final article in this series, we’ll build on the concepts I’ve introduced here to break down my favorite proof in the paper, which I think is a beautiful illustration of strategic classification and SVC in action.

References

[1] R. Sundaram, A. Vullikanti, H. Xu, F. Yao. PAC-Learning for Strategic Classification (2021), International Conference on Machine Learning.

Quantifying the Complexity and Learnability of Strategic Classification Problems was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read More

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Loading Disqus Comments ...

No Trackbacks.