I am struggling to figure out what to start with first. Do I explain what the project is or why I got involved. I think it's best to explain what it is first because I believe it is a pretty important project.

Get Tested COVID 19

This project is an Open Source project that is trying to map out all of the testing locations in the United States. The actual site is located at http://gettestedcovid.org/ and the source code is located at https://github.com/Scalabull/get-tested-covid19. I found the project via https://helpwithcovid.com/. So far there are nine contributors and my part in it is running the Amazon Web Sevices (AWS) infrastructure that provides the site. I can't stress this enough, the contributors to the project are what make it successful. I think my involvement is providing them a great environment to develop this project on.

Why am I helping

I got involved, to be honest because I am bored sitting at home. As of writing this I have been working from home for 56 days. This is not my first time working from home but the last time I did work from home full time, four years ago, I only lasted three weeks before requesting that my company at the time pay for https://www.coworkfrederick.com/. My company IronNet https://ironnet.com/ closed all of our offices on March 16th and will not open the Frederick office until after the Maryland state of emergency is over. My job at IronNet has recently taken on a new role as Cloud Infrastructure Team Manager and if I am lucky I get to write code once a week. I am missing that most of all.

Before COVID-19 I attended between five to seven tech meetups a month in Frederick, Maryland.

I attended monthly board meetings on the Golden Mile Alliance https://goldenmilealliance.org/, techfrederick https://techfrederick.org/, and the Veterans of Foreign Wars Post 3285 https://www.vfw3285.org/. It is weird for me to type this out but I have a lot more free time on my hands. So between having a lot of free time and missing out on writing code, I searched for a COVID-19 project that needed someone to help with Infrastructure things.

What I am doing for the project

I do not really have a defined role in the project. I respond to various questions about AWS, Docker, Scaling, and other Infrastructure related things. When I joined the project the staging and production website was hosted on Rackspace. I don’t have anything against Rackspace I just know AWS better and knew that if the Marketing team is going to be successful a single server in Rackspace is not going to cut it. So I migrated things to AWS.

The website was deployed with Buddy. I also don’t have anything against Buddy I just don't have any experience with it. So I migrated the deployment to CircleCI. Not because I know it better but I knew that the community behind CircleCI was stronger and finding out how to do what I wanted is more possible with CircleCI.

The project organizer Zach Boldyga contacted AWS and got us a few thousand dollars in credits. That enabled me to start building the new infrastructure on AWS. The idea I had was to use the AWS Fargate service to run the containers, AWS Application Load Balancers (ALB) to distribute the load to the containers and AWS Relational Database Service (RDS) to serve up data to the containers. All of this is done with CircleCI for continuous integration via Terraform. I wanted the site to be easy to update by any developer and Terraform’s documentation is great. I also found that using a few Open Source modules helped get the site up and running faster even though I had to fork one of the modules to make it work for us.

The architecture of the project is as follows:

All of the projects AWS resources are kept in a single AWS account.  This is against AWS best practices but seemed fine enough based on the credits being applied to this account and the size of the architecture.  I felt I could use environment tags to separate things out well enough and as long as the Terraform deployments were respected by the team no changes would be made accidentally to the wrong environments.

Users of the site resolve the location of the ALB for the site from Route53, AWS's Domain Name System (DNS). Once a user is sent to the ALB it provides an action based on the hostname being requested. Right now we have two services, web and API. If the hostname is api.get-tested-covid19.org the user is sent to the API target group. A target group is a collection of endpoints provided as tasks from the AWS Fargate service. Any other requested hostnames use the default action to go to the web service which generates the page. If the user hits the API it will query the database in RDS and results are sent back to the user.

The Virtual Private Cloud (VPC) is set up in three availability zones (AZ) for fault tolerance. This enables us to lose one or two AZs and still provide the site for users.  The likely hood of losing two AZs is pretty low. Spread across those AZs are our subnets. We have three public and three private subnets in the VPC to allow for the architecture to span those AZs. Lastly, we have a single Network Address Translation (NAT) gateway. We could pay for three NAT gateways however if the AZ with the NAT gateway were to go down that would not cause the site to go down it would only inhibit any non-ALB traffic to leave the VPC which there shouldn't be very much. I did this for cost savings.

Docker

We dockerized the environment to enable the developers to use modern development and deployment techniques. Docker-compose is available for developers to use and that enables them to test out changes like environment variables and API calls to other services locally before pushing them to staging and production.  Because they use docker locally, it is more likely that they will figure out the issue in development and the fix will translate to staging and production with minimal effort on their parts.

Terraform

I choose to use Terraform for two reasons. 1. It has gained a lot of popularity, and the community behind it is great. 2. I need to use it more because of the work being done at my company. You can find the Terraform that deploys the environment at https://github.com/Scalabull/get-tested-covid19/tree/master/infrastructure.  I copied IronNet's fork of a Terraform state backend that uses AWS S3 and DynamoDB.  This works by saving the Terraform state to an S3 bucket with the filename that matches the environment. DynamoDB is used to “lock” a deployment so that if anyone else tries to deploy while another person is deploying, they will have to wait.

In main.tf all of our AWS resources are created. The infrastructure isn’t big enough in my opinion to warrant multiple Terraform files just yet. We start with defining a few data sources. Data sources allow data to be fetched and stored for use later in the deployment.

data "aws_availability_zones" "available" {
    state = "available" # gets us the availability zonse we can use that are available to our account
}

data "aws_caller_identity" "current" {} # gives us our account information. This helps us build unique resources.

data "aws_region" "current" {} # sets the region we are in.

For locals we just specify a few. Locals are values used within a template.

locals {
    name = "${var.environment}-${var.name}" # sets a local name for use later
    env_dns_prefix = var.environment != "master" ? "${var.environment}." : "" # sets a prefix for various environments that we use later for individual dev environemnts, staging and production
    image_tag = var.environment == "master" ? "latest" : "staging" # sets the image tag for use later
    common_tags = { # common tags we can use for all of the different resources that support tags
        Name                                  = local.name
        Purpose                               = "gettestedcovid19 infra"
        environment                           = var.environment
    }
}

Next we need a VPC. It would not be horrible to write all of the VPC stuff ourselves but to get a good start I used a fairly popular module.

module "saas_vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = local.name # using our local name set above
  cidr = var.vpc_cidr # we set a variable for a CIDR range and use it here

  azs             = ["${data.aws_availability_zones.available.names[0]}", "${data.aws_availability_zones.available.names[1]}", "${data.aws_availability_zones.available.names[2]}"] # this is our first use of a data source. In this case we use it to set the AZs we want to use
  private_subnets = [cidrsubnet(var.vpc_cidr, 3, 0), cidrsubnet(var.vpc_cidr, 3, 1), cidrsubnet(var.vpc_cidr, 3, 2)] # terraform has some built in functions. In this case we are using cidrsubnet to calculate the subnets for our public and private subnets in the vpc. More information found at https://www.terraform.io/docs/configuration/functions/cidrsubnet.html
  public_subnets  = [cidrsubnet(var.vpc_cidr, 3, 3), cidrsubnet(var.vpc_cidr, 3, 4), cidrsubnet(var.vpc_cidr, 3, 5)]

  enable_dns_hostnames = true # this is telling the VPC we want DNS hostnames. AWS can generate the hostnames for instances if this is enabled. I use it because it is a nice feature
  enable_nat_gateway   = true # NAT gateway is a feature AWS added a few years ago to manage the instance that provides internet access to the private subnet.  Previously you had to do this yourself
  single_nat_gateway   = true # Usually you want a NAT gateway in each public subnet to manage its own private subnet but the impact of the NAT going down is minimal

  tags = local.common_tags # sets the common tags we talked about before
}

Now for the meat of the environment. AWS Fargate is a serverless compute engine for containers. We use this to serve the containers. I chose Fargate so we did not have to worry about managing the instances for the docker containers to run on. AWS manages this for us with Fargate. The module came with some nice features but I forked it because of some issues. Namely, it created a bunch of extra resources in AWS CodeBuild that we do not use. Also, each service got its own ALB which would be costly so my fork updates the ALB to point to the correct target group depending on the hostname specified by the user's call.

module "fargate" {
  source  = "git@github.com:patrickpierson/terraform-aws-fargate.git" # my fork of this module
  name = "gtcv"
  environment = "${var.environment}"
  vpc_create = false # this module can create its own vpc but we are not using it
  vpc_external_id = module.saas_vpc.vpc_id # if it doesn't create its own vpc you specify which vpc to use here

  vpc_external_public_subnets_ids = module.saas_vpc.public_subnets # same as above, specify the subnets to use
  vpc_external_private_subnets_ids = module.saas_vpc.private_subnets
  default_action = "${local.env_dns_prefix}get-tested-covid19.org" # I added this in my fork. This was to allow the ALB to have a default action to send it to the correct target

  services = { # defining the services
    www = {
      task_definition = "service-templates/webserver-${local.image_tag}.json" # specifies which service template to use for the task definition
      container_port  = 3000 # the port the container is on
      cpu             = "256" # cpu resources
      memory          = "512" # memory resources
      replicas        = 1 # starting replica count. this changes with alarms later
      logs_retention_days      = 14 # number of days the cloudwatch logs store days
      health_check_interval = 30 # number of seconds between health checks
      health_check_path     = "/" # path to check for health
      acm_certificate_arn = "arn:aws:acm:us-east-1:${data.aws_caller_identity.current.account_id}:certificate/0713fcea-afdb-4d36-9804-f0ec4a221857" # Amazon Certificate generated outside of Terraform, provides the TLS cert for the ALB to use
      auto_scaling_max_replicas = 50 # max number of replicas for this task. This will allow us to handle 200,000 users
      auto_scaling_requests_per_target = 4000 # alarm metric that gets set to scale up the number of replicas
      host = "${local.env_dns_prefix}get-tested-covid19.org" # the host this target group will be sent from the ALB
    }
    api = {
      task_definition = "service-templates/api-${local.image_tag}.json"
      container_port  = 5000
      cpu             = "256"
      memory          = "512"
      replicas        = 1
      registry_retention_count = 15
      logs_retention_days      = 14
      health_check_interval = 30
      health_check_path     = "/ping"
      acm_certificate_arn = "arn:aws:acm:us-east-1:${data.aws_caller_identity.current.account_id}:certificate/0713fcea-afdb-4d36-9804-f0ec4a221857"
      auto_scaling_max_replicas = 50
      auto_scaling_requests_per_target = 4000
      host = "${local.env_dns_prefix}api.get-tested-covid19.org"
    }
  }
}

Next we setup RDS. We went with Aurora with Postgres compatability because of Auroras scaling but Postgres's populatity with developers and its PostGIS capabilities.

module "db" {
  source  = "terraform-aws-modules/rds-aurora/aws" # source is from the Terraform registry
  version = "~> 2.0"

  name                            = local.name

  engine                          = "aurora-postgresql" # sets the RDS engine we want
  engine_version                  = "11.6" # sets the RDS engine version number we want

  vpc_id                          = module.saas_vpc.vpc_id # sets the VPC we created already
  subnets                         = module.saas_vpc.private_subnets # sets the private subnets we want to use

  replica_count                   = 1 # right now we only need one replica
  allowed_cidr_blocks             = ["10.200.0.0/21"] # currently allowing the whole vpc but this could be scaled down
  instance_type                   = "db.t3.medium" # RDS instance size
  storage_encrypted               = true # setting the flag for encryption at rest
  apply_immediately               = true # RDS has a nice feature that changes to RDS can be applied during mataince windows. We are not using that
  monitoring_interval             = 10 # setting the granularity
  performance_insights_enabled    = true # enabeling additional metrics

  db_parameter_group_name         = local.name
  db_cluster_parameter_group_name = local.name

  enabled_cloudwatch_logs_exports = ["postgresql"]

  tags                            = local.common_tags
}

To access the database I setup a bastion host using a module.

module "bastion" {
  source  = "philips-software/bastion/aws"
  version = "2.0.0"
  enable_bastion = true # this was weird to me. I am not sure what it does but figured we could set it to false and the bastion would go away until needed

  environment = var.environment
  project     = local.name

  aws_region = var.aws_region
  key_name   = "bastion" # outside of Terraform I setup a key name called bastion and saved it to S3 for devs to use
  subnet_id  = element(module.saas_vpc.public_subnets, 0) # this sets the bastion to use only one public subnet
  vpc_id     = module.saas_vpc.vpc_id
  tags = local.common_tags
}

Lastly I set up another data source and used it to create a local_file resource for devs to use to get information on how to connect through the bastion to the RDS database.

data "template_file" "connection_txt" {
  template = file("${path.module}/templates/connection.txt.tpl")

  vars = { # these vars are the various pieces of information that pull from the bastion and rds modules
    environment                     = var.environment
    rds_cluster_endpoint            = module.db.this_rds_cluster_endpoint
    rds_cluster_instance_endpoints  = module.db.this_rds_cluster_instance_endpoints[0]
    rds_cluster_reader_endpoint     = module.db.this_rds_cluster_reader_endpoint
    rds_cluster_master_username     = module.db.this_rds_cluster_master_username
    rds_cluster_master_password     = module.db.this_rds_cluster_master_password
    bastion_public_ip               = module.bastion.public_ip
  }
}

resource "local_file" "connection_txt" {
  sensitive_content = data.template_file.connection_txt.rendered
  filename = "./connection.txt" # generates the local file of connection.txt for the dev to use
}

Thanks for reading

I really appreciate people taking the time to read this. The project is only successful because of the people involved.  If you have free time take a moment to see if you can help out a project like this.  Stay safe out there.