DSelect-k
DSelect-k is a sparse gate for Mixture-of-experts that allows explicit control over the number of experts to select. It is based on a novel binary encoding formulation that is continuously differentiable, making it compatible with first-order methods such as stochastic gradient descent. The Problem with Existing Sparse Gates Existing sparse gates, such as Top-k, are not smooth. This lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based