Over the last few weeks I’ve been working on a deep neural net to predict website credibility (i.e. how “reliable” it is). The features consist of basic website features such as its domain and a bag-of-words model.
Website credibility is determined by a lot of things and a lot of the time there isn’t a right or wrong answer. Wikipedia, for example, is a notorious source because it can be edited by anyone. Nonetheless, Wikipedia does contain a lot of correct and is still considered unreliable.
Although there is no exact answer, we can often predict the credibility through many features such as the author, the “purpose” of the text, and even the date. (More can be found here)
To gather data for the website credibility checker, I gathered 40 sites that were considered reliable by both students and teachers and another 40 sites with the same topics that were considered not reliable. I tried adding more sites that were not so clear on credibility, such as a Hawaiian vacation package that describes its history and a reddit post discussing a research paper, hoping that the neural net will patterns to solve these harder sites that bag-of-words can fail on. After adding the new harder cases, the network was able to get through a lot of advertisements disguised as academic sources but fails on the original sites. Adding more data may be able to solve this issue but for now, we will stick with the 80 sites.
To turn the 80 URLs into data, I wrote a python script to scrap the sites using BeautifulSoup. A snippet of the script that counts the appearance number of keywords:
soup = BeautifulSoup(r.content, "html5lib") # remove scripts and style for script in soup(["script", "style"]): script.extract() text = soup.get_text().lower().split() for word in keywords: urlData.append(text.count(word))
Deep Neural Network
At the start, I was thinking of doing more credibility-determining features such as the author and the references. Yet, due to the complexity of scraping webpages, these features could not be gathered automatically (Regex and other methods were inconsistent). Other than the bag-of-words model, I was only able to obtain the domain, specifically whether it is a .edu/.gov or not, and if the website has SSL (https). The bag-of-words model were keywords that had the highest ratios of either credible / non-credible or non-credible / credible, making them the words that made most of the differences in credibility of the data.
With all features set and a script to automatically scrap sites for them, we now have a [80,77] csv dataset for the network.
Quick note: I did write the network 3 different ways: tensorflow using sessions, tensorflow canned estimator, and sklearn, but for a better understanding, I will be showing code from tensorflow using sessions.
The neural network is consists of 3 hidden layers with [50,30,10] neurons. I have tried using 2 hidden layers but after some testing the 3 layer network performances a bit better. The activation function was relu, aka the rectified linear unit.
In order to prevent overfiting my small dataset, l2 regularization was applied to the cross entropy cost function while dropout was also implemented.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = logits, labels = y) + reg_rate * tf.nn.l2_loss(weights['hidden1']) + reg_rate * tf.nn.l2_loss(weights['hidden2']) + reg_rate * tf.nn.l2_loss(weights['hidden3']) + reg_rate * tf.nn.l2_loss(weights['output']) )
(It could be more performance-optimized to add the regularization values first then multiple it by the rate).
After A LOT of testing difference parameters, I decided on the following:
learning rate = 0.01
epochs = 6000
l2 regression strength = 0.1
dropout rate = 0.2
Despite of the 6000 epochs, the cost of the trained network stayed around 0.5 at the least while the accuracy was around 0.9. The lack of data could be the cause of this as there may not be enough data (or not good enough data) for the network to find a complex pattern.
The network uses a 2-output method for binary classification because of some weird issue with the cross entropy function of tensorflow. In order to do so, I one-hot encoded the Y array.
y_all = np.eye(2)[y_targets]
(where y_targets is the original Y array).
Values of [1,0,1] for example, would be encoded in to [ [0,1], [1,0], [0,1] ].
The ouputs also go through sigmoid to get the probability/confidence of website credibility.
After numerous tweaks and testing, the final neural network has a testing accuracy of around 93.5%. Surprisingly, this network also works with the more complex sites that I mentioned earlier. To show the predictions, here is a list of websites (in no specific order):
The predictions of the network are:
[ 0.08280195 0.91719806]
[ 0.0656008 0.93439919]
[ 0.53568166 0.46431834]
[ 0.03955348 0.96044648]
[ 0.78341526 0.21658476]
For the first and second websites, the network was pretty confident that they were reliable, which they were. As for the Reddit post, the answer isn’t quite clear as there were not so much keywords covered in the post. The network, nonetheless, predicted the answer correctly. Now, it gets interesting. The neural net thinks that wikipedia is a very credible source. In fact, if we look at the data of it, there are many high-credibility keywords and very few non-credibility ones. Last but not least, the network also predicts the animal removal company’s “fact” page correctly.
Overall, the network performs pretty well and can identify the credibility of most websites. Yet, there are also many other factors that can affect credibility such as the tone of the writing and even the information itself, making this network more suitable for a self-check rather than the definite answer. It is not worth your grades to relying purely on some complex math equations.
(I’m planning to make a website version of this and will allow people to test their sites with a single click).