Project Waifu: Speaker Verification

Project Waifu

Project Waifu is a long-term machine learning/deep learning project I will be working on. I will not reveal too much about it, but here’s the first part of the pipeline: speaker verification.

Text-Independent Speaker Verification

Speaker verification is the process of recognizing the identity of the speaker which in this case, is either 1 (is who we want to identify) or 0 (not the person). A lot of algorithms online uses GMMs and/or creates profiles for speakers. For this project, a MLP (multi-layer perception – regular feed-forward neural network) is used and because of the way it is structured, the algorithm performs pretty well.


Project Waifu (without revealing too) requires an algorithm to slice audio of an anime character speaking. Although it may sound compelling to recognize where the character is speaking first and then slice it, it actually performs badly as it is very inconsistent. To solve this issue, we slice the audio first with a Voice Activity Detector (VAD) and then go through each slice to predict whether it is the character or not. To be specific, I used py-webrtcvad, a python implementation of Google’s WebRTC VAD.


The algorithm takes in MFCC features with 39 cepstrals. To obtain such features, I used a 20ms sliding window with 10ms overlaps. Yet, just 20ms of data is not enough — 10 frames was condensed into one instance (one input of the neural network) with a hop size of 3. This idea was from the paper Neural Network Based Speaker Classification and Verification Systems with Enhanced Features by Zhenhao Ge. (The diagram on 3 shows the concatenation very well).

Deep Neural Network

The neural network is just a simple 4-layer (3 hidden layers and 1 output layer) feed-forward neural network. All hidden layers use relu and the output layer uses sigmoid to present a possibility of the being speaker or not. The implementation is again done in Tensorflow, so the code is almost identical with my previous attempt of using a DNN to predict website credibility.

Just a quick note, because the algorithm predicts every 10ms on a 20ms frame, even a few words can have a few hundred instances. In order to give the final verdict, I averaged the prediction on such instances. This is not a really good way of doing this because the VAD actually is not that good at separating a conversation (at least for anime). Consequently, some slices may have more than 1 speaker or just some random noise that was somehow classified as voice activity. With that being said, the averaging of a lot of predictions helps this inconsistency as I’ve only seen it classifying samples with noises or another speaker as false but not true. (It wouldn’t hurt if we mis a few as long as other speakers don’t get mix in there).

Results Waifu!

The algorithm ended up predicting the speaker pretty well. To test it, I used the anime My Youth Romantic Comedy Is Wrong, As I Expected to verify the character Yukino Yukinoshita (雪ノ下 雪乃). The VAD first sliced episode 1 of the anime into 288 wave files. I then manually selected 55 slices and lablled them as either (1,0) for is Yukino and (0,1) vice versa.

The network ran about 200 epoches (started to converge but not quite) on batch Adam and here are the results:

140 is the character Yui Yuigahama.

151 is indeed Yukino Yukinoshita.

155 is Hachiman Hikigaya, a male character who was surprisingly more difficult to distinguish.

Final Thoughts

To be honest, I thought that this would be a much harder task. Yet, it turns out that even a simple neural network can classify it just as good.

For the next thing in the pipeline for Project Waifu, it will take a lot longer as it is much complicated. (Or at least I think it is). Although, judging from this project’s experience, I could try a MLP, I am probably going to try a more complicated network such as a CNN or a RNN to get a better understanding of deep learning in general.

That is about all of Project Waifu for now. Hopefully I’ll be able to post the next update in a month or two.

As for anyone that’s wondering, waifu is basically a female character in visual media (usually from anime but can also be from manga and video games) that people consider as their wives (thus the name “waifu”). To be extra clear, I’m not a weeb and I don’t have a waifu pillow. No. I like anime, but no I’m not a weeb. lol

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.