Project Waifu – CNN

The speaker verification now uses a convolutional neural network (CNN) rather than an ANN as described here. This new algorithm gives the speaker verification system a massive improvement on performance (both accuracy-wise and resource-wise).

The Performance of the CNN

The CNN, without much hyperparameter tuning, is able to get a cost of lower than 0.1% in a few hundred epochs. The old algorithm, however, reaches similar performance at over a thousand epochs. Similarly, the CNN also runs much faster than the ANN, although this may be due to CUDA implementations.

The Structure of the CNN

The inputs data, similar to the previous ANN, consists of MFCC features with 39 cepstrals. You can read about it here. Rather than stacking them on top of each other, the CNN concatenates them into a 39×10 matrix (39 MFCC features by 10 frames). It is then fed into a multi-layer CNN with basic convolution and pooling layers. At the end of the CNN, a fully connected layer and a softmax layer outputs the result as either [1,0] or [0,1].

Waifu GUI

Waifu GUI had some really bad design so a large portion of it was rewritten. Although there is not much difference from the user’s perspective, it cleans a lot of stuff up for better future use. The following screenshots are taken using the “dark” mode.

The loading page while training the CNN

Using the CNN to predict the speaker

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.