AWS CDK(v2): Build AWSBatch environment for FARGATE_SPOT

Page content

We will use AWS CDK v2 to build an AWSBatch environment for FARGATE_SPOT. To create an AWSBatch environment, we first create a

  • VPC
  • ComputeEnvironment
  • JobQueue

for each task you want to run, and a

  • JobDefinition

for each task you want to execute. I was not very familiar with this configuration and stumbled a lot, so it took me about a day the first time I made it… Well, maybe next time I can do it in an hour or so.

Preparation

The following preparations are assumed to have been made

Version

  • aws-cdk: 2.20.0

Key Points

How to write depends on the type of ComputeEnvironment

The types of ComputeEnvironment are described here Currently there are four types: EC2 | FARGATE | FARGATE_SPOT | SPOT. The type of JobQueue, which can be specified and which must be specified, seems to change depending on the type of JobQueue.

In this example, FARGATE_SPOT is used.

ecsTaskExecutionRole

In my case, it was created sometime ago, but if not, you need to create one. You can find instructions at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_execution_IAM_role.html and so on.

VPC

You can either create a new one or use an existing one. I will leave both ways of writing.

executionRoleArn and jobRoleArn

  • The executionRoleArn is the minimum Role required to start Batch execution (e.g. pull an image).
  • jobRoleArn is used when a container needs a Role to execute further.

assignPublicIp

If you don’t set this to ENABLED, you will get an error because you can’t pull the container Image. However, the error message will be as follows if the container is in docker.io, and

CannotPullContainerError: inspect image has been retried 5 time(s):
failed to resolve ref "docker.io/library/busybox:latest": failed to do request:
Head https://registry-1.docker.io/v2/library/busybox/manifests/latest: dial tcp 54.85.133.123:443: i/o t...

If the container is in the ECR, it will look like this

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed:
unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError:
send request failed caused by: Post https://api.ecr....

It took me quite a while to solve the problem because I couldn’t quite figure out the cause from here.

platformCapabilities

If you do not specify this, you will be stuck with a ComputeEnvironment of type FARGATE_SPOT that cannot be executed (i.e., cannot be placed in a JobQueue).

Code

// lib/awsbatch-stack.ts
import { aws_batch, Stack, StackProps } from "aws-cdk-lib";
import { IVpc, SecurityGroup, SubnetType, Vpc } from "aws-cdk-lib/aws-ec2";
import { Construct } from "constructs";

// Stack name to be created this time
const STACK_BASE_NAME = "SampelAWSBatch";

// Specify if using an existing VPC
const VPC_ID = "vpc-12345678";

// If there is no `ecsTaskExecutionRole`, it must be created.
const DEFAULT_EXEC_ROLE_ARN =
  "arn:aws:iam::<<AWS_ACCOUNT_ID>>:role/ecsTaskExecutionRole";

export class SampleAWSBatchStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    // https://docs.aws.amazon.com/cdk/api/v1/docs/aws-batch-readme.html

    ///////////////////////////////////////////////////////////////////
    // Prepare VPC
    ///////////////////////////////////////////////////////////////////
    let vpc: IVpc;

    if (!VPC_ID) {
      // When creating a new VPC
      vpc = new Vpc(this, `${STACK_BASE_NAME}VPC`, {
        cidr: "10.9.0.0/16", // 172.16.0.0/16 or whatever.
        subnetConfiguration: [
          {
            name: `${STACK_BASE_NAME}Subnet`,
            subnetType: SubnetType.PUBLIC,
            cidrMask: 18,
          },
        ],
      });
    } else {
      // If you are using an existing VPC.
      // To use `Vpc.fromLookup() `.
      // it seems that you need to specify the region and accountId in the `env` in `bin/awsbatch.ts` or in the environment variables when running the cdk.
      vpc = Vpc.fromLookup(this, "VPC", {
        vpcId: VPC_ID,
      });
    }

    ///////////////////////////////////////////////////////////////////
    // Security Group in the VPC
    ///////////////////////////////////////////////////////////////////
    const securityGroup = new SecurityGroup(this, `${STACK_BASE_NAME}SG`, {
      vpc: vpc,
    });

    ///////////////////////////////////////////////////////////////////
    // ComputeEnvironment type=FARGATE_SPOT
    ///////////////////////////////////////////////////////////////////
    const fargateSpotEnvironment = new aws_batch.CfnComputeEnvironment(
      this,
      `${STACK_BASE_NAME}ComputeEnvironment`,
      {
        type: "MANAGED",
        computeEnvironmentName: STACK_BASE_NAME,
        computeResources: {
          type: "FARGATE_SPOT",
          maxvCpus: 64,
          subnets: vpc.publicSubnets.map((x) => x.subnetId), // List of SubnetId
          securityGroupIds: [securityGroup.securityGroupId],
        },
      }
    );

    ///////////////////////////////////////////////////////////////////
    // Create JobQueue
    ///////////////////////////////////////////////////////////////////
    const jobQueue = new aws_batch.CfnJobQueue(
      this,
      `${STACK_BASE_NAME}JobQueue`,
      {
        jobQueueName: STACK_BASE_NAME,
        computeEnvironmentOrder: [
          {
            computeEnvironment:
              fargateSpotEnvironment.attrComputeEnvironmentArn,
            order: 1,
          },
        ],
        priority: 1,
      }
    );

    ///////////////////////////////////////////////////////////////////
    // Create JobDefinitions
    ///////////////////////////////////////////////////////////////////
    const jobs: { [key: string]: string } = {}; // repoUri -> JobDefArn
    for (const setting of CONTAINER_JOB_SETTINGS) {
      const jobDef = new aws_batch.CfnJobDefinition(
        this,
        `${setting.jobName}JobDef`,
        {
          type: "container",
          jobDefinitionName: setting.jobName,
          platformCapabilities: ["FARGATE"], // Note: If FARGATE is not specified, it will not run in a FARGATE environment.
          containerProperties: {
            image: setting.imageUri,
            executionRoleArn: DEFAULT_EXEC_ROLE_ARN,
            jobRoleArn: setting.jobRoleArn,
            resourceRequirements: [
              { type: "MEMORY", value: String(setting.memory) },
              { type: "VCPU", value: String(setting.vcpu) },
            ],
            networkConfiguration: {
              assignPublicIp: "ENABLED", // Note: Without it, you cannot access ECR.
            },
          },
          retryStrategy: {
            attempts: 1,
          },
        }
      );
      jobs[setting.imageUri] = jobDef.ref;
    }
  }
}

export type ContainerJobSetting = {
  imageUri: string;
  jobName: string;
  jobRoleArn?: string;
  memory: number; // in MB
  vcpu: number;
};

/**
 * JobDefinition information
 * Note that the combinations of Memory and CPU that can be specified are limited to the following.
 * https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-batch-jobdefinition-resourcerequirement.html
 */
const CONTAINER_JOB_SETTINGS: ContainerJobSetting[] = [
  {
    imageUri: "busybox",
    jobName: "HelloWorld",
    memory: 512,
    vcpu: 0.25,
  },
];

Afterword

It’s not a big deal once you figure it out, but it’s quite a challenge to get there.