Populating a EC2- and Fargate Cross-Compatible ECS Cluster from a Spot Fleet

When Amazon ECS was first released back in April 2015, it left a lot to be desired: tasks and services could only be run on a cluster you managed, clusters had limited support for limited support for autoscaling and spot instances, and so on. Amazon filled these gaps over the next couple of years with support for scaling policies,  great blog posts on integrating spot fleets with ECS, and now even a wizard cluster builder for EC2-only clusters.  But even while the tools improved, you were still managing cluster capacity, which is a pain. ECS really came into its own with the release of Fargate, which allows you to run ECS tasks on Amazon-managed “virtual capacity” in your cluster so you could finally stop counting servers.

While Fargate is great, it still costs more than running on Spot Fleets, and it can take a few minutes to start tasks. This isn’t a problem for most static workloads, but the delay in particular can be irritating for dynamic loads, especially when users are waiting for containers to start and stop. As a result, my team has started keeping a reasonably-sized EC2 Spot Fleet cluster warm, and then using Fargate for overflow. This provides the best possible user experience: most things start quickly and run cheaply, but nothing fails due to insufficient capacity.

The trick with this configuration, though, is you need to configure your Spot Fleet so that the same tasks can run with LaunchType set to EC2 or FARGATE. It’s not trivial; here are some of our lessons learned.

  • Start by reading this article
  • Both the EC2 and ECS Spot Fleet Builders allows you to export your configuration as a CloudFormation template. The ECS output is an excellent starting point for provisioning Spot Fleet cluster capacity using CloudFormation.
  • Fargate requires tasks to use awsvpc network mode, so we’ll want our cluster to accept tasks run in awsvpc network mode, too. This requires you to run a fairly modern version of the ECS agent, so pick your AMI from this list of the latest releases for each region.
  • This “task networking mode” attaches an Elastic Networking Interface to each task, and each cluster member will only support a limited number of interfaces. Functionally, this is one more resource you’ll need to track, just like memory and CPU. If you see a failure like RESOURCE:ENI, you’ll know you’ve tried to schedule a task on a box with insufficient networking slots. You’ll need to update either your placement strategy or your cluster.
  • Amazon manages these ENIs and assigns them private IP addresses automatically. That means you’ll need to run them in a private subnet in a VPC. Functionally, this means you’ll need to perform the following steps for the single VPC in which you want your EC2 instances to live:
    • Create a new “public subnet” with “Assign Public IP” set to true
    • Create a new “private subnet” with “Assign Public IP” set to false
    •  Create a new NAT Gateway located in the public subnet
    • Create a routing table that routes non-local content through the NAT Gateway
    • Assign the routing table to the private subnet
  • Because these instances are in a private subnet, you’ll need to use a “bastion host” to SSH into these hosts to troubleshoot. Essentially, just create a new host in the public subnet from above, SSH into that box, and SSH into your ECS instance from there.
  • Bear in mind that VPC Security Groups include rules for Outgoing traffic as well as Incoming. You’ll need to attach a security group to your ECS instances that not only lets you access incoming ports on your host — don’t forget SSH! — and outgoing ports for your app as well as the ECS agent. I found this StackOverflow thread to be useful. If you’re having trouble and you suspect networking, it’s worth temporarily allowing all traffic in and out to confirm the source of issues; if that fixes the problem, you’ll know it’s related to networking, and likely even ports.

Additionally, here is the CloudFormation template we use to set up new stacks. It’s edited from the ECS Spot Fleet wizard builder mentioned above. It allows on-demand ASG cluster building as well as Spot Fleet cluster building.

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  AWS CloudFormation template to create a new VPC
  or use an existing VPC for ECS deployment
  in Create Cluster Wizard. Requires exactly 4
  Instance Types for a Spot Request.
Parameters:
  EcsClusterName:
    Type: String
    Description: >
      Specifies the ECS Cluster Name with which the resources would be
      associated
    Default: default
  EcsAmiId:
    Type: String
    Description: Specifies the AMI ID for your container instances.
  EcsInstanceType:
    Type: CommaDelimitedList
    Description: >
      Specifies the EC2 instance type for your container instances.
      Defaults to m4.large
    Default: m4.large
    ConstraintDescription: must be a valid EC2 instance type.
  KeyName:
    Type: String
    Description: >
      Optional - Specifies the name of an existing Amazon EC2 key pair
      to enable SSH access to the EC2 instances in your cluster.
    Default: ''
  VpcId:
    Type: String
    Description: >
      Optional - Specifies the ID of an existing VPC in which to launch
      your container instances. If you specify a VPC ID, you must specify a list of
      existing subnets in that VPC. If you do not specify a VPC ID, a new VPC is created
      with atleast 1 subnet.
    Default: ''
    AllowedPattern: "^(?:vpc-[0-9a-f]{8}|)$"
    ConstraintDescription: >
      VPC Id must begin with 'vpc-' or leave blank to have a
      new VPC created
  SubnetIds:
    Type: CommaDelimitedList
    Description: >
      Optional - Specifies the Comma separated list of existing VPC Subnet
      Ids where ECS instances will run
    Default: ''
  SecurityGroupId:
    Type: String
    Description: >
      Optional - Specifies the Security Group Id of an existing Security
      Group. Leave blank to have a new Security Group created
    Default: ''
  AsgMaxSize:
    Type: Number
    Description: >
      Specifies the number of instances to launch and register to the cluster.
      Defaults to 1.
    Default: '1'
  IamRoleInstanceProfile:
    Type: String
    Description: >
      Specifies the Name or the Amazon Resource Name (ARN) of the instance
      profile associated with the IAM role for the instance
  EcsEndpoint:
    Type: String
    Description: >
      Optional - Specifies the ECS Endpoint for the ECS Agent to connect to
    Default: ''
  EbsVolumeSize:
    Type: Number
    Description: >
      Optional - Specifies the Size in GBs, of the newly created Amazon
      Elastic Block Store (Amazon EBS) volume
    Default: '0'
  EbsVolumeType:
    Type: String
    Description: Optional - Specifies the Type of (Amazon EBS) volume
    Default: ''
    AllowedValues:
      - ''
      - standard
      - io1
      - gp2
      - sc1
      - st1
    ConstraintDescription: Must be a valid EC2 volume type.
  DeviceName:
    Type: String
    Description: Optional - Specifies the device mapping for the Volume
  UseSpot:
    Type: String
    Default: 'false'
  IamSpotFleetRoleArn:
    Type: String
    Default: ''
  SpotPrice:
    Type: String
    Default: ''
  SpotAllocationStrategy:
    Type: String
    Default: 'diversified'
    AllowedValues:
      - 'lowestPrice'
      - 'diversified'
  UserData:
    Type: String
    Description: Optional base-64 encoded User Data for created instances
    Default: ''
  IsWindows:
    Type: String
    Default: 'false'
Conditions:
  CreateEC2LCWithKeyPair:
    !Not [!Equals [!Ref KeyName, '']]
  SetEndpointToECSAgent:
    !Not [!Equals [!Ref EcsEndpoint, '']]
  CreateWithSpot: !Equals [!Ref UseSpot, 'true']
  CreateWithASG: !Not [!Condition CreateWithSpot]
  CreateWithSpotPrice: !Not [!Equals [!Ref SpotPrice, '']]
  DefaultUserData: !Equals [!Ref UserData, '']
Resources:
  EcsInstanceLc:
    Type: AWS::AutoScaling::LaunchConfiguration
    Condition: CreateWithASG
    Properties:
      ImageId: !Ref EcsAmiId
      InstanceType: !Select [ 0, !Ref EcsInstanceType ]
      AssociatePublicIpAddress: true
      IamInstanceProfile: !Ref IamRoleInstanceProfile
      KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
      SecurityGroups: !Ref SecurityGroupId
      BlockDeviceMappings:
      - DeviceName: !Ref DeviceName
        Ebs:
         VolumeSize: !Ref EbsVolumeSize
         VolumeType: !Ref EbsVolumeType
      UserData: !If
        - DefaultUserData
        - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
        - !Ref UserData
  EcsInstanceAsg:
    Type: AWS::AutoScaling::AutoScalingGroup
    Condition: CreateWithASG
    Properties:
      VPCZoneIdentifier: !Ref SubnetIds
      LaunchConfigurationName: !Ref EcsInstanceLc
      MinSize: '0'
      MaxSize: !Ref AsgMaxSize
      DesiredCapacity: !Ref AsgMaxSize
      Tags:
        -
          Key: Name
          Value: !Sub "ECS Instance - ${AWS::StackName}"
          PropagateAtLaunch: 'true'
        -
          Key: Description
          Value: "This instance is the part of the Auto Scaling group which was created through ECS Console"
          PropagateAtLaunch: 'true'
  EcsSpotFleet:
    Condition: CreateWithSpot
    Type: AWS::EC2::SpotFleet
    Properties:
      SpotFleetRequestConfigData:
        AllocationStrategy: !Ref SpotAllocationStrategy
        IamFleetRole: !Ref IamSpotFleetRoleArn
        TargetCapacity: !Ref AsgMaxSize
        SpotPrice: !If [ CreateWithSpotPrice, !Ref SpotPrice, !Ref 'AWS::NoValue' ]
        TerminateInstancesWithExpiration: true
        LaunchSpecifications:
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 0, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 1, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 2, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
            -
              IamInstanceProfile:
                Arn: !Ref IamRoleInstanceProfile
              ImageId: !Ref EcsAmiId
              InstanceType: !Select [ 3, !Ref EcsInstanceType ]
              KeyName: !If [ CreateEC2LCWithKeyPair, !Ref KeyName, !Ref "AWS::NoValue" ]
              Monitoring:
                Enabled: true
              SecurityGroups:
                - GroupId: !Ref SecurityGroupId
              SubnetId: !Join [ "," , !Ref SubnetIds ]
              BlockDeviceMappings:
                    - DeviceName: !Ref DeviceName
                      Ebs:
                       VolumeSize: !Ref EbsVolumeSize
                       VolumeType: !Ref EbsVolumeType
              UserData: !If
                - DefaultUserData
                - Fn::Base64: !Sub "#!/bin/bash\necho \"ECS_CLUSTER=${EcsClusterName}\" >> /etc/ecs/ecs.config"
                - !Ref UserData
Outputs:
  EcsInstanceAsgName:
    Condition: CreateWithASG
    Description: Auto Scaling Group Name for ECS Instances
    Value: !Ref EcsInstanceAsg
  EcsSpotFleetRequestId:
      Condition: CreateWithSpot
      Description: Spot Fleet Request for ECS Instances
      Value: !Ref EcsSpotFleet
  UsedByECSCreateCluster:
    Description: Flag used by ECS Create Cluster Wizard
    Value: 'true'
  TemplateVersion:
    Description: The version of the template used by Create Cluster Wizard
    Value: '2.0.0'

Hopefully this helps others set up cross-compatible EC2-Fargate tasks. It’s not hard, but it requires getting a lot of small things right, as we learned the hard way. Hopefully this example makes things simpler for folks.