Detecting news bias with AWS Comprehend
Sun 10 February 2019 by Patrick PiersonFirst off, let me say that this is in no way an attempt by me to attack the news. I believe a free press plays a huge part in our freedom. The key take away from this should be that the majority of the headlines (greater then 80%) are neutral, but there does seem to be a slight negative bias on each news outlet I pulled (except for The Daily Caller which was positive leaning).
Why did I do this?
We are all guilty of it: Judging books by their cover. We read just the headline of a news article and carry on without reading the article. I wanted to see how news agencies add bias to their headlines. I did this by grabbing the RSS (Really Simple Syndication) feeds of CNN, Fox News, and other news outlets.
How?
To do this, I built a Django app connected to a Postgres database. I used the RSS feeds to pull down the latest articles of each site, and then only stored the headline, date and feed ID (that I assigned) for each article. As the headline was pulled, I passed it to AWS Comprehend for sentiment analysis to show if the headline was positive, negative, neutral or mixed.
I used the following libraries to do all of the work:
xmltodict - The RSS feeds are in XML. This library converts the XML to Python dictoniaries to make it easier for me to parse the data.
requests - The RSS feeds needed to be pulled down off of the internet. Requests made that really easy to do.
boto3 - To interact with the Comprehend service
Django - As a simple framework to glue all of the pieces together and provide an Admin UI to add new Feeds as needed
psycopg2 - To interact with Postgres
Lastly, I had a cron job run every 30 minutes and curl the feeds endpoint of the Django app.
There are tons of great articles that will show you how to setup a Django project. Below is the code I wrote to pull the headlines and pass them to the AWS Comprehend service.
import xmltodict
import requests
import boto3
from datetime import datetime
from .models import Feed, Headline
from pprint import pprint
class News(object):
def __init__(self):
self.headlines = []
self.feeds = self.get_feeds_list()
for feed in self.feeds:
self.parse_feed_data(self.get_feed(feed), feed)
self.count = self.save_headlines()
self.client = boto3.client('comprehend')
self.get_sentiment()
def get_feeds_list(self):
urls = []
for i in Feed.objects.all():
urls.append(i.url)
return urls
def get_feed(self, feed):
res = requests.get(feed)
return xmltodict.parse(res.text)
def parse_feed_data(self, feed_data, feed):
for item in feed_data.get('rss').get('channel').get('item'):
data = (item.get('title'), feed)
self.headlines.append(data)
def save_headlines(self):
count = 0
for headline_data in self.headlines:
if headline_data[1]:
feed_name = Feed.objects.get(url=headline_data[1])
else:
print('Feed not found')
print(headline_data[1])
exit()
headline, created = Headline.objects.get_or_create(
title=headline_data[0],
feed_name=feed_name,
date=datetime.now()
)
if created:
count += 1
print('Headline added/updated')
else:
print('Headline existed')
return count
def get_sentiment(self):
for headline in Headline.objects.all():
if headline.sentiment:
print('Sentiment for headline: %s is %s' % (headline.title, headline.sentiment))
else:
print('No sentiment for headline: %s' % headline.title)
response = self.client.detect_sentiment(
Text=headline.title,
LanguageCode='en'
)
headline.sentiment = response.get('Sentiment')
headline.save()
print('Sentiment for headline: %s is %s' % (headline.title, headline.sentiment))
def count(self):
return self.count
if __name__ == "__main__":
news = News()
news.count()
Results
There seems to be a negative bias on CNN and Fox News.
CNN
For CNN I pulled 7766 headlines. Approximately 9% were negative, 3% were positive and the rest neutral at 88%.
Fox News
For Fox News I pulled 6409 headlines. Approximately 5% were negative, 2% were positive and the rest neutral at 93%
The Daily Caller
For The Daily Caller I pulled 4169 headlines. Approximately 4% were negative, 5% were positive and the rest neutral at 91%.
National Review
For National Review I pulled 666 headlines. Approximately 7% were negative, 5% were positive and the rest neutral at 88%.
New Republic
For New Republic I pulled 176 headlines. Approximately 11% were negative, 7% were positive, 1% mixed, and the rest neutral at 81%.
NY Times
For NY Times I pulled 4247 headlines. Approximately 4% were negative, 3% were positive and the rest neutral at 93%.
Closing
I am only trying to make an observation with this project and use a new service with AWS. I could see a follow on to this project using a comparable Google service or I could stand up my own sentiment analysis endpoint using OpeNER. My only point is that from my data the following percent of headlines had a positive or negative bias:
CNN: 12.2%
Fox News: 6.8%
The Daily Caller: 8.2%
National Review: 12.2%
New Republic: 18.2%
NY Times: 6.9%
PiAware and Python
Aircraft are all over the place, lets capture their data
read moreSimple JSON parsing and Requests
Use the output from the Requests package to parse JSON easily.
read more