The speaker verification now uses a convolutional neural network (CNN) rather than an ANN as described here. This new algorithm gives the speaker verification system a massive improvement on performance (both accuracy-wise and resource-wise).
The Performance of the CNN
The CNN, without much hyperparameter tuning, is able to get a cost of lower than 0.1% in a few hundred epochs. The old algorithm, however, reaches similar performance at over a thousand epochs. Similarly, the CNN also runs much faster than the ANN, although this may be due to CUDA implementations.
The Structure of the CNN
The inputs data, similar to the previous ANN, consists of MFCC features with 39 cepstrals. You can read about it here. Rather than stacking them on top of each other, the CNN concatenates them into a 39×10 matrix (39 MFCC features by 10 frames). It is then fed into a multi-layer CNN with basic convolution and pooling layers. At the end of the CNN, a fully connected layer and a softmax layer outputs the result as either [1,0] or [0,1].
Waifu GUI had some really bad design so a large portion of it was rewritten. Although there is not much difference from the user’s perspective, it cleans a lot of stuff up for better future use. The following screenshots are taken using the “dark” mode.