Influence Methods

Influence is a measure of data quality (ie. degree to which a training example affects the model and its predictive performance). Quantification of influence is difficult due to complexity of deep learning models as well as their growing size, features, and datasets. There are typically 2 classes of approach: sample-based and feature-based.

1. Tracln

Tracln (Trace Inference) is a method that traces the training process to capture changes in prediction as individual training examples are visited. Tracln can compute the influence of training data examples on individual predictions and identify outliers that exhibit high self-influence (mislabeled or rare example).

1.1 Idealized Notion of Influence

Let $Z$ represent the space of examples, we train predictors parameterized by a weight vector $w \in \mathbb{R}^p$. We measure the performance of a predictor via a loss function $l : \mathbb{R}^p \times Z \to \mathbb{R}$, and we represent the loss of a predictor parameterized by $w$ on an example $z$ by $l(w, z)$. Given a set of $n$ training points, we train the predictor by finding parameters $w$ that minimize the training loss $\sum_{i=1}^n l(w, z_i)$ via some optimization procedure like stochastic gradient descent that utilizes one training example $z_t$ in iteration $t$, updating the parameter from $w_t$ to $w_{t+1}$. The we define the idealized notion of influence of a particular training example $z \in S$ on a given test example by

$$ \text{TracInIdeal}(z, z') = \sum_{t: z_t = z} l(w_t, z') - l(w_{t+1}, z') $$

Notice that this does not work for mini-batch training.

1.2 Proponents and Opponents

We term training examples that have a positive value of influence score as proponents as they serve to reduce loss. And we term examples that have a negative value of influence score as opponents, as they serve to increase loss.

1.3 Approximation to Idealized Influence.

By Taylor’s theorem, we have the approximation

$$ l(w_{t+1}, z') = l(w_t, z') + \nabla l(w_t, z')\cdot (w_{t+1} - w_t) + O(||w_{t+1} - w_t||^2) $$

If stochastic gradient descent is utilized in training the model, using the training point $z_t$ at iteration $t$, then the change in parameters is $w_{t+1} - w_t = \eta_t\nabla l(w_t, z_t)$, where $\eta_t$ is the step size in iteration $t$. Combining this with the estimation above, we see that we have the approximation

$$ l(w_t, z') - l(w_{t+1}, z') \approx \eta_t\nabla (w_t, z') \cdot \nabla l(w_t, z_t) $$

Using this difference, we can approximate the ideal TracIn by

$$ \text{TracIn}(z, z') = \sum_{t: z_t = z} \eta_t \nabla l(w_t, z') \cdot \nabla l(w_t, z) $$

1.4 Extensions to mini-batches

To handle mini-batches of size $b \geq 1$, we compute the influence of a mini-batch on the test point similarly as before. Taking first order approximation like above will yield