Groups
0-1 loss directly measures classification error but is discontinuous and non-convex, making optimization computationally hard.
Knowledge distillation loss blends standard hard-label cross-entropy with a soft distribution match from a teacher using a temperature parameter.