Instead of deriving the result, we start from the result and work out an intuitive understanding of why it is reasonable.
First recall that all matrices can be decomposed into (Rotation 1) * (Scale) * (Rotation 2) using a singular value decomposition; assume we've done this decomposition:
(M^-1)^T = ((R1 S1 R2)^-1)^T
Now take the inverse transpose on each matrix individually; first the inverse gives us:
(R2^-1 S^-1 R1^-1)^T (reversing order)
and the transpose gives us
(R1^T)^-1 (S^T)^-1 (R2^T)^-1 (reversing order again)
Now, note that the transpose of a rotation matrix IS the inverse (b/c it's orthogonal), and the transpose of a scale matrix does not change the scale matrix at all, because it's diagonal. So really we have:
R1 S^-1 R2
In other words, the inverse-transpose of a matrix leaves the rotation unchanged, while inverting the scale.
Intuitively, leaving the rotation unchanged makes sense: for rigid transforms, right angles (all angles) are preserved, so we can just rotate the surface normal in the same way as the surface itself.
Now we can look at how the normal changes as we stretch a sphere into an ellipse (say, stretch it along the X axis): as the ellipse approaches a cylinder, the forward transform would increasingly cause any normal to be parallel to the +X vector. We want the normal to move in the opposite way, and in the limit to lie in the perpendicular YZ plane; therefore the inverse scale makes a great deal of sense.