This result partially explains why the MLE, which has the smallest training loss, is also likely to achieve a small testing error when there are enough training examples.
Here L is the expected loss. θ̂ is the minimizer of the training loss, while θ∗ is the ground truth parameter. This result partially explains why the MLE, which has the smallest training loss, is also likely to achieve a small testing error when there are enough training examples. One limitation of the above result is that it requires well-specifiedness, i.e., the data are distributed precisely according to a particular ground truth parameter θ∗ in the parameter space. We would like to prove a more general result in the following form without assuming well-specifiedness.1 L(θ̂)− L(θ∗) ≤ f(p, n), ∀p, n ≥ 1.