Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet

When Amazon ECS was first released back in April 2015, it left a lot to be desired: tasks and services could only be run on a cluster you managed, clusters had limited support for limited support for autoscaling and spot instances, and so on. Amazon filled these gaps over the next couple of y ears with support for scaling policies,  great blog posts on integrating spot fleets with ECS, and now even a wizard cluster builder for EC2-only clusters.  But even while the tools improved, you were still managing cluster capacity, which is a pain. ECS really came into its own with the release of Fargate, which allows you to run ECS tasks on Amazon-managed “virtual capacity” in your cluster so you could finally stop counting servers.

While Fargate is great, it still costs more than running on Spot Fleets, and it can take a few minutes to start tasks. This isn’t a problem for most static workloads, but the delay in particular can be irritating for dynamic loads, especially when users are waiting for containers to start and stop. As a result, my team has started keeping a reasonably-sized EC2 Spot Fleet cluster warm, and then using Fargate for overflow. This provides the best possible user experience: most things start quickly and run cheaply, but nothing fails due to insufficient capacity.

The trick with this configuration, though, is you need to configure your Spot Fleet so that the same tasks can run with LaunchType set to EC2 or FARGATE. It’s not trivial; here are some of our lessons learned.

  • Start by reading this article
  • Both the EC2 and ECS Spot Fleet Builders allows you to export your configuration as a CloudFormation template. The ECS output is an excellent starting point for provisioning Spot Fleet cluster capacity using CloudFormation.
  • Fargate requires tasks to use awsvpc network mode, so we’ll want our cluster to accept tasks run in awsvpc network mode, too. This requires you to run a fairly modern version of the ECS agent, so pick your AMI from this list of the latest releases for each region.
  • This “task networking mode” attaches an Elastic Networking Interface to each task, and each cluster member will only support a limited number of interfaces. Functionally, this is one more resource you’ll need to track, just like memory and CPU. If you see a failure like RESOURCE:ENI, you’ll know you’ve tried to schedule a task on a box with insufficient networking slots. You’ll need to update either your placement strategy or your cluster.
  • Amazon manages these ENIs and assigns them private IP addresses automatically. That means you’ll need to run them in a private subnet in a VPC. Functionally, this means you’ll need to perform the following steps for the single VPC in which you want your EC2 instances to live:
    • Create a new “public subnet” with “Assign Public IP” set to true
    • Create a new “private subnet” with “Assign Public IP” set to false
    •  Create a new NAT Gateway located in the public subnet
    • Create a routing table that routes non-local content through the NAT Gateway
    • Assign the routing table to the private subnet
  • Because these instances are in a private subnet, you’ll need to use a “bastion host” to SSH into these hosts to troubleshoot. Essentially, just create a new host in the public subnet from above, SSH into that box, and SSH into your ECS instance from there.
  • Bear in mind that VPC Security Groups include rules for Outgoing traffic as well as Incoming. You’ll need to attach a security group to your ECS instances that not only lets you access incoming ports on your host — don’t forget SSH! — and outgoing ports for your app as well as the ECS agent. I found this StackOverflow thread to be useful. If you’re having trouble and you suspect networking, it’s worth temporarily allowing all traffic in and out to confirm the source of issues; if that fixes the problem, you’ll know it’s related to networking, and likely even ports.

Additionally, here is the CloudFormation template we use to set up new stacks. It’s edited from the ECS Spot Fleet wizard builder mentioned above. It allows on-demand ASG cluster building as well as Spot Fleet cluster building.

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  AWS CloudFormation template to create a new VPC
  or use an existing VPC for ECS deployment
  in Create Cluster Wizard. Requires exactly 4
  Instance Types for a Spot Request.
Parameters:
  EcsClusterName:
    Type: String
    Description: >
      Specifies the ECS Cluster Name with which the resources would be
      associated
    Default: default
  EcsAmiId:
    Type: String
    Description: Specifies the AMI ID for your container instances.
  EcsInstanceType:
    Type: CommaDelimitedList
    Description: >
      Specifies the EC2 instance type for your container instances.
      Defaults to m4.large
    Default: m4.large
    ConstraintDescription: must be a valid EC2 instance type.
  KeyName:
    Type: String
    Description: >
      Optional - Specifies the name of an existing Amazon EC2 key pair
      to enable SSH access to the EC2 instances in your cluster.
    Default: ''
  VpcId:
    Type: String
    Description: >
      Optional - Specifies the ID of an existing VPC in which to launch
      your container instances. If you specify a VPC ID, you must specify a list of
      existing subnets in that VPC. If you do not specify a VPC ID, a new VPC is created
      with atleast 1 subnet.
    Default: ''
    AllowedPattern: "^(?:vpc-[0-9a-f]{8}|)$"
    ConstraintDescription: >
      VPC Id must begin with 'vpc-' or leave blank to have a
      new VPC created
  SubnetIds:
    Type: CommaDelimitedList
    Description: >
      Optional - Specifies the Comma separated list of existing VPC Subnet
      Ids where ECS instances will run
    Default: ''
  SecurityGroupId:
    Type: String
    Description: >
      Optional - Specifies the Security Group Id of an existing Security
      Group. Leave blank to have a new Security Group created
    Default: ''
  AsgMaxSize:
    Type: Number
    Description: >
      Specifies the number of instances to launch and register to the cluster.
      Defaults to 1.
    Default: '1'
  IamRoleInstanceProfile:
    Type: String
    Description: >
      Specifies the Name or the Amazon Resource Name (ARN) of the instance
      profile associated with the IAM role for the instance
  EcsEndpoint:
    Type: String
    Description: >
      Optional - Specifies the ECS Endpoint for the ECS Agent to connect to
    Default: ''
  EbsVolumeSize:
    Type: Number
    Description: >
      Optional - Specifies the Size in GBs, of the newly created Amazon
      Elastic Block Store (Amazon EBS) volume
    Default: '0'
  EbsVolumeType:
    Type: String
    Description: Optional - Specifies the Type of (Amazon EBS) volume
    Default: ''
    AllowedValues:
      - ''
      - standard
      - io1
      - gp2
      - sc1
      - st1
    ConstraintDescription: Must be a valid EC2 volume type.
  DeviceName:
    Type: String
    Description: Optional - Specifies the device mapping for the Volume
  UseSpot:
    Type: String
    Default: 'false'
  IamSpotFleetRoleArn:
    Type: String
    Default: ''
  SpotPrice:
    Type: String
    Default: ''
  SpotAllocationStrategy:
    Type: String
    Default: 'diversified'
    AllowedValues:
      - 'lowestPrice'
      - 'diversified'
  UserData:
    Type: String
    Description: Optional base-64 encoded User Data for created instances
    Default: ''
  IsWindows:
    Type: String
    Default: 'false'
Conditions:
  CreateEC2LCWithKeyPair:
    !Not [!Equals [!Ref KeyName, '']]
  SetEndpointToECSAgent:
    !Not [!Equals [!Ref EcsEndpoint, '']]
  CreateWithSpot: !Equals [!Ref UseSpot, 'true']
  CreateWithASG: !Not [!Condition CreateWithSpot]
  CreateWithSpotPrice: !Not [!Equals [!Ref SpotPrice, '']]
  DefaultUserData: !Equals [!Ref UserData, '']
Resources:
  EcsInstanceLc:
    Type: AWS::AutoScaling::LaunchConfiguration
    Condition: CreateWithASG
    Properties:
      ImageId: !Ref EcsAmiId
      InstanceType: !Select [ 0, !Ref EcsInstanceType ]
      AssociatePublicIpAddress: true
      IamInstanceProfile: !Ref IamRoleInstanceProfile
      KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
      SecurityGroups: !Ref SecurityGroupId
      BlockDeviceMappings:
      - DeviceName: !Ref DeviceName
        Ebs:
         VolumeSize: !Ref EbsVolumeSize
         VolumeType: !Ref EbsVolumeType
      UserData: !If
        - DefaultUserData
        - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
        - !Ref UserData
  EcsInstanceAsg:
    Type: AWS::AutoScaling::AutoScalingGroup
    Condition: CreateWithASG
    Properties:
      VPCZoneIdentifier: !Ref SubnetIds
      LaunchConfigurationName: !Ref EcsInstanceLc
      MinSize: '0'
      MaxSize: !Ref AsgMaxSize
      DesiredCapacity: !Ref AsgMaxSize
      Tags:
        -
          Key: Name
          Value: !Sub "ECS Instance - ${AWS::StackName}"
          PropagateAtLaunch: 'true'
        -
          Key: Description
          Value: "This instance is the part of the Auto Scaling group which was created through ECS Console"
          PropagateAtLaunch: 'true'
  EcsSpotFleet:
    Condition: CreateWithSpot
    Type: AWS::EC2::SpotFleet
    Properties:
      SpotFleetRequestConfigData:
        AllocationStrategy: !Ref SpotAllocationStrategy
        IamFleetRole: !Ref IamSpotFleetRoleArn
        TargetCapacity: !Ref AsgMaxSize
        SpotPrice: !If [ CreateWithSpotPrice, !Ref SpotPrice, !Ref 'AWS::NoValue' ]
        TerminateInstancesWithExpiration: true
        LaunchSpecifications:
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 0, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 1, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 2, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 3, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
Outputs:
  EcsInstanceAsgName:
    Condition: CreateWithASG
    Description: Auto Scaling Group Name for ECS Instances
    Value: !Ref EcsInstanceAsg
  EcsSpotFleetRequestId:
      Condition: CreateWithSpot
      Description: Spot Fleet Request for ECS Instances
      Value: !Ref EcsSpotFleet
  UsedByECSCreateCluster:
    Description: Flag used by ECS Create Cluster Wizard
    Value: 'true'
  TemplateVersion:
    Description: The version of the template used by Create Cluster Wizard
    Value: '2.0.0'

Hopefully this helps others set up cross-compatible EC2-Fargate tasks. It’s not hard, but it requires getting a lot of small things right, as we learned the hard way. Hopefully this example makes things simpler for folks.

A Year of Mashable — 2012

At work I finally got around to doing a project I’ve been wanting to do for a long time: analyze the sharing behavior of a year’s worth of content at Mashable.

It’s no small project. First, a year’s worth of Mashable content must be collected, which ended up being 13,979 articles in total. Next, the author, publish date, headline, and full text of each post must be extracted from each page, which requires a (fortunately simple) custom scraper to be built. Next, the social resonance data of each article must be collected. For this analysis, I collected share counts for Twitter, Facebook, StumbleUpon, LinkedIn, Google+, and Pinterest, plus clicks from Bitly and per-article submissions from Reddit. Continue reading “A Year of Mashable — 2012”

Should I Use an ORM or Not? Sure.

There are a whole lot of strong opinions about ORM floating around the internet and elsewhere. When you see so manypassionateconflicting opinions in so many different threads, it’s a pretty clear sign you’re looking at a religious argument rather than a rational debate. And, as in any good religious argument — big endian or little endian, butter side up or butter side down, vi or emacs, Team Jacob or Team Edward — this one has two sides, too.

Still a Better Love Story than Twilight.

Continue reading “Should I Use an ORM or Not? Sure.”

Complication is What Happens When You Try to Solve a Problem You Don’t Understand

Code should be simple. Code should be butt simple. Code should be so simple that there’s no way it can be misunderstood. Good code has no nooks. Good code has no crannies. Good code is a round room with no corners for bugs to hide in.

We all know this. So why does most code suck?

Because it’s written by people who don’t understand the problem they’re trying to solve.

Continue reading “Complication is What Happens When You Try to Solve a Problem You Don’t Understand”

Learning to Program

Eventually, every programmer blogs about how to become a better programmer. It seems to be the price of admission to the industry. Programmers are a vain lot, and every one of us likes to think he has a unique viewpoint to contribute with insightful advice and meaningful guidance. The reality is that the “learn how to program” post is cliché. There are so many that each new one is nothing more than an echo of some old, vaguely-remembered, proto-learn-how-to-program-post. No one should write another. There’s no point.

So obviously I’m going to write another.

Programming is Exactly Like This

Continue reading “Learning to Program”

The Origin of Perfect Software

In another post, I claimed that software can’t be written with no bugs at all. Well, it turns out that’s not quite true. What I shouldhave said is that writing bug-free software is not possible within the constraints of most software businesses or open-source projects.

But that just doesn’t have the same pizazz, does it?

The trouble is that software businesses exist to make money, and open source projects exist to give developers interesting things to do and exposure. (Naturally, there are some exceptions in both camps, but if you imagine that’s always true, you won’t be too far off.) And if these are the goals you’re chasing — customers and money, or interesting problems and exposure — you don’t end up with perfect software. You go broke or get bored before you get there.

Continue reading “The Origin of Perfect Software”

The Economics of Perfect Software

Ask 100 CEOs of software companies if they want to ship software with bugs. What will they say? 50 won’t answer at all, saying something about how bugs are a huge problem in the industry that needs to be addressed; 40 will say “Of course not!” and promptly call their shark tank in preparation for a lawsuit; 9 will hang their heads and say “we can’t help it”; and that last 1 will look you straight in the eye and say “Absolutely.”

I have no idea what that last guy’s doing heading up a software company, because he studied economics.

Continue reading “The Economics of Perfect Software”