less than 1 minute read

You might have seen MSE formula in the following form:

\[MSE = \frac{1}{2} \sum_{i=1}^{n} (y_i - \hat{y_i})^2\]

where \(y_i\) is the ground truth and \(\hat{y_i}\) is the prediction. The question is, where does the \(\frac{1}{2}\) come from?

The answer is simple. It is just a constant to make the derivative of the MSE formula simpler. When we take the derivative of the MSE formula, we get a constant of 2 from the power rule. The \(\frac{1}{2}\) is there to cancel out the 2. That way the derivative becomes simpler.

From technical standpoint, it does not matter whether we use the \(\frac{1}{2}\) or any other constant. We can even omit it if we want to. This is because the constant will only affect the learning rate of the gradient descent. Since we also control the learning rate in the hyperparameter, we can just adjust it to compensate the constant.