Detecting news bias with AWS Comprehend

Sun 10 February 2019 by Patrick Pierson

First off, let me say that this is in no way an attempt by me to attack the news. I believe a free press plays a huge part in our freedom. The key take away from this should be that the majority of the headlines (greater then 80%) are neutral, but there does seem to be a slight negative bias on each news outlet I pulled (except for The Daily Caller which was positive leaning).

Why did I do this?

We are all guilty of it: Judging books by their cover. We read just the headline of a news article and carry on without reading the article. I wanted to see how news agencies add bias to their headlines. I did this by grabbing the RSS (Really Simple Syndication) feeds of CNN, Fox News, and other news outlets.

How?

To do this, I built a Django app connected to a Postgres database. I used the RSS feeds to pull down the latest articles of each site, and then only stored the headline, date and feed ID (that I assigned) for each article. As the headline was pulled, I passed it to AWS Comprehend for sentiment analysis to show if the headline was positive, negative, neutral or mixed.

I used the following libraries to do all of the work:

xmltodict - The RSS feeds are in XML. This library converts the XML to Python dictoniaries to make it easier for me to parse the data.
requests - The RSS feeds needed to be pulled down off of the internet. Requests made that really easy to do.
boto3 - To interact with the Comprehend service
Django - As a simple framework to glue all of the pieces together and provide an Admin UI to add new Feeds as needed
psycopg2 - To interact with Postgres

Lastly, I had a cron job run every 30 minutes and curl the feeds endpoint of the Django app.

There are tons of great articles that will show you how to setup a Django project. Below is the code I wrote to pull the headlines and pass them to the AWS Comprehend service.

import xmltodict
import requests
import boto3
from datetime import datetime
from .models import Feed, Headline

from pprint import pprint


class News(object):
    def __init__(self):
        self.headlines = []
        self.feeds = self.get_feeds_list()
        for feed in self.feeds:
            self.parse_feed_data(self.get_feed(feed), feed)
        self.count = self.save_headlines()
        self.client = boto3.client('comprehend')
        self.get_sentiment()

    def get_feeds_list(self):
        urls = []
        for i in Feed.objects.all():
            urls.append(i.url)
        return urls

    def get_feed(self, feed):
        res = requests.get(feed)
        return xmltodict.parse(res.text)

    def parse_feed_data(self, feed_data, feed):
        for item in feed_data.get('rss').get('channel').get('item'):
            data = (item.get('title'), feed)
            self.headlines.append(data)

    def save_headlines(self):
        count = 0
        for headline_data in self.headlines:
            if headline_data[1]:
                feed_name = Feed.objects.get(url=headline_data[1])
            else:
                print('Feed not found')
                print(headline_data[1])
                exit()
            headline, created = Headline.objects.get_or_create(
                title=headline_data[0],
                feed_name=feed_name,
                date=datetime.now()
            )
            if created:
                count += 1
                print('Headline added/updated')
            else:
                print('Headline existed')
        return count

    def get_sentiment(self):
        for headline in Headline.objects.all():
            if headline.sentiment:
                print('Sentiment for headline: %s is %s' % (headline.title, headline.sentiment))
            else:
                print('No sentiment for headline: %s' % headline.title)
                response = self.client.detect_sentiment(
                    Text=headline.title,
                    LanguageCode='en'
                )
                headline.sentiment = response.get('Sentiment')
                headline.save()
                print('Sentiment for headline: %s is %s' % (headline.title, headline.sentiment))

    def count(self):
        return self.count


if __name__ == "__main__":
    news = News()
    news.count()

Results

There seems to be a negative bias on CNN and Fox News.

CNN

For CNN I pulled 7766 headlines. Approximately 9% were negative, 3% were positive and the rest neutral at 88%. CNN Pie chart

Fox News

For Fox News I pulled 6409 headlines. Approximately 5% were negative, 2% were positive and the rest neutral at 93% Fox News Pie chart

The Daily Caller

For The Daily Caller I pulled 4169 headlines. Approximately 4% were negative, 5% were positive and the rest neutral at 91%. The Daily Caller Pie chart

National Review

For National Review I pulled 666 headlines. Approximately 7% were negative, 5% were positive and the rest neutral at 88%. National Review Pie chart

New Republic

For New Republic I pulled 176 headlines. Approximately 11% were negative, 7% were positive, 1% mixed, and the rest neutral at 81%. New Republic Pie chart

NY Times

For NY Times I pulled 4247 headlines. Approximately 4% were negative, 3% were positive and the rest neutral at 93%. NY Times Pie chart

Closing

I am only trying to make an observation with this project and use a new service with AWS. I could see a follow on to this project using a comparable Google service or I could stand up my own sentiment analysis endpoint using OpeNER. My only point is that from my data the following percent of headlines had a positive or negative bias:

CNN: 12.2%
Fox News: 6.8%
The Daily Caller: 8.2%
National Review: 12.2%
New Republic: 18.2%
NY Times: 6.9%