AWS Batch

AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters. AWS Batch plans, schedules, and executes AWS customers batch computing workloads using Amazon EC2 and Spot Instances.

Batch computing is the execution of a series of programs (“jobs”) on one or more computers without manual intervention. Input parameters are pre-defined through scripts, command-line arguments, control files, or job control language. A given batch job may depend on the completion of preceding jobs, or on the availability of certain inputs, making the sequencing and scheduling of multiple jobs important, and incompatible with interactive processing.

Batch computing is used by developers, scientists, and engineers to access large amounts of compute resources. Amazon batch service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. It automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. 

  • AWS Batch dynamically provisions the optimal quantity and type of compute resources of CPU or memory optimized instances based on the volume and specific resource requirements of the batch jobs submitted.
  • AWS Batch plans, schedules, and executes client’s batche computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances.
AWS Batch

Amazon Batch Benefits

AWS Batch handles job execution and compute resource management, allowing customers to focus on developing applications or analyzing results instead of setting up and managing infrastructure. If anyone who considers running or moving batch workloads to AWS, AWS Batch would be a perfect choice.

AWS Batch is optimized for batch computing and applications that scale through the execution of multiple jobs in parallel. Deep learning, genomics analysis, financial risk models, Monte Carlo simulations, animation rendering, media transcoding, image processing, and engineering simulations are all excellent examples of batch computing applications.

AWs Batch can shift the time of job processing to periods when greater or less expensive capacity is available. It avoids idling compute resources with frequent manual intervention and supervision. It increases efficiency by driving higher utilization of compute resources. It enables the prioritization of jobs, aligning resource allocation with business objectives.

AWS Batch provisions compute resources and optimizes the job distribution based on the volume and resource requirements of the submitted batch jobs. AWS Batch dynamically scales compute resources to any quantity required to run batch jobs, freeing users from the constraints of fixed-capacity clusters. AWS Batch will utilize Spot Instances on users behalf, reducing the cost of running the batch jobs further.

AWS Batch Components

Job Definitions: AWS Batch job definitions specify how jobs are to be run. While each job must reference a job definition, many of the parameters that are specified in the job definition can be overridden at runtime. The following are some of the attribute:

  • Which Docker image to use with the container in your job
  • How many vCPUs and how much memory to use with the container
  • The command the container should run when it is started
  • What (if any) environment variables should be passed to the container when it starts
  • Any data volumes that should be used with the container
  • What (if any) IAM role users job should use for AWS permissions.

Job Queue: Jobs are submitted to a job queue, where they reside until they are able to be scheduled to run in a compute environment. Any AWS account can have multiple job queues. 

  • Customers can create a queue that uses Amazon EC2 On-Demand instances for high priority jobs and another queue that uses Amazon EC2 Spot Instances for low-priority jobs. 
  • Job queues have a priority that is used by the scheduler to determine which jobs in which queue should be evaluated for execution first.
  • The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a job queue. Jobs run in approximately the order in which they are submitted as long as all dependencies on other jobs have been met.

Node Group: A node group is an identical group of job nodes, where all nodes share the same container properties. AWS Batch lets customers specify up to five distinct node groups for each job. 

  • Each group can have its own container images, commands, environment variables, and so on.
  • In addition they also can use all of the nodes in their job as a single node group, and the application code can differentiate node roles from main node to child node.

Array Job: An array job is a job that shares common parameters, such as the job definition, vCPUs, and memory. It runs as a collection of related, yet separate, basic jobs that may be distributed across multiple hosts and may run concurrently. Array jobs are the most efficient way to execute embarrassingly parallel jobs such as Monte Carlo simulations, parametric sweeps, or large rendering jobs.

  • AWS Batch array jobs are submitted just like regular jobs. However, the array size needs to be between 2 and 10,000 to define how many child jobs should run in the array.
  • If the submitted job has an array size of 1000, a single job runs and spawns 1000 child jobs. The array job is a reference or pointer to manage all the child jobs. This allows customers to submit large workloads with a single query.

Multi-Node Parallel Jobs: Multi-node parallel jobs are used to run single jobs that span multiple Amazon EC2 instances. Batch multi-node parallel jobs can run large-scale, tightly coupled, high performance computing applications and distributed GPU model training without the need to launch, configure, and manage Amazon EC2 resources directly. An AWS Batch multi-node parallel job is compatible with any framework that supports IP-based, internode communication, such as Apache MXNet, TensorFlow, Caffe2, or Message Passing Interface (MPI).

  • Multi-node parallel jobs are submitted as a single job, or as a job submission node overrides, that specifies the number of nodes to create for the job and what node groups to create. 
  • Each multi-node parallel job contains a main node, which needs to be launched first. Once the main node is up and running, the child nodes will be launched and started.
    • If the main node exits, the job is considered finished, and the child nodes are stopped. For more information, see Node Groups.
  • Multi-node parallel job nodes are single-tenant, meaning that only a single job container is run on each Amazon EC2 instance.
  • Each multi-node parallel job contains a main node. The main node is a single subtask that AWS Batch monitors to determine the outcome of the submitted multi node job.
    • The main node is launched first and it moves to the STARTING status.

AWS Batch Features

AWS support customers who want to use GPU scheduling, which allows them to specify the number and type of accelerators their jobs require as job definition input variables in AWS Batch. 

  • Graphics Processing Unit(GPU) is a processor designed to handle graphics operations. This includes both 2D and 3D calculations, though GPUs primarily excel at rendering 3D graphics.
  • AWS Batch will scale up instances appropriate for the customers jobs based on the required number of GPUs and isolate the accelerators according to each job’s needs, so only the appropriate containers can access them.
  • All instance types in a compute environment that will run GPU jobs should be from the p2, p3, g3, g3s, or g4 instance families. If this is not done a GPU job could get stuck in the RUNNABLE status.

AWS Batch allows customers to specify resource requirements, such as vCPU and memory, AWS Identity and Access Management (IAM) roles, volume mount points, container properties, and environment variables, to define how jobs are to be run. AWS Batch executes the jobs as containerized applications running on Amazon ECS. it also also enables customers to define dependencies between different jobs.

  • AWS Batch displays key operational metrics for the batch jobs in the AWS Management Console. You can view metrics related to compute capacity, as well as running, pending, and completed jobs. 
  • AWS Batch uses IAM to control and monitor the AWS resources that your jobs can access, such as Amazon DynamoDB tables.
  • Through IAM, users can define policies for different users in your organization. For example, admins can be granted full access permissions to any AWS Batch API operation.

AWS Batch efficiently and dynamically provisions and scales Amazon EC2 and Spot Instances based on the requirements of the jobs. If elected to provision and manage own compute resources within AWS Batch Unmanaged Compute Environments clients need to use different configurations such as larger EBS volumes.

  • EC2 Launch Templates reduce the number of steps required to configure Batch environments by capturing launch parameters within one resource.
  • AWS batch enable customers to build customized templates for your compute resources, and enabling Batch to scale instances with those requirements.
  • AWS batch allows to specify EC2 Launch Template to add storage volumes, specify network interfaces, or configure permissions, among other capabilities.

AWS Batch supports multi-node parallel jobs, which enables users to run single jobs that span multiple EC2 instances. This feature allows AWS customers to use AWS Batch to easily and efficiently run workloads such as large-scale, tightly-coupled High Performance Computing (HPC) applications or distributed GPU model training. 

  • AWS Batch also supports Elastic Fabric Adapter, a network interface that enables users to run applications that require high levels of inter-node communication at scale on AWS.
  • AWS Batch can be integrated with commercial and open-source workflow engines and languages such as Pegasus WMS, Luigi, Nextflow, Metaflow, Apache Airflow, and AWS Step Functions, which enable to use familiar workflow languages to model the batch computing pipelines.

AWS Batch allows customers to choose three methods to allocate compute resources. 

Best Fit: AWS Batch selects an instance type that best fits the needs of the jobs with a preference for the lowest-cost instance type. If additional instances of the selected instance type are not available, AWS Batch will wait for the additional instances to be available. If there are not enough instances available, then additional jobs will not be run until currently running jobs have completed. 

Best Fit Progressive: AWS Batch will select additional instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types with a lower cost per unit vCPU. If additional instances of the previously selected instance types are not available, AWS Batch will select new instance types.

Spot Capacity Optimized: AWS Batch will select one or more instance types that are large enough to meet the requirements of the jobs in the queue, with a preference for instance types that are less likely to be interrupted. 

AWS Batch enables clients to set up multiple queues with different priority levels. Batch jobs are stored in the queues until compute resources are available to execute the job.

  • The AWS Batch scheduler evaluates when, where, and how to run jobs that have been submitted to a queue based on the resource requirements of each job.
  • The scheduler evaluates the priority of each queue and runs jobs in priority order on optimal compute resources (e.g., memory vs CPU optimized), as long as those jobs have no outstanding dependencies.
  • AWS Batch will scale up instances appropriate for the jobs based on the required number of GPUs and isolate the accelerators according to each job’s needs, so only the appropriate containers can access them.

Jobs

 

AWS Batch Jobs

Jobs are the unit of work executed by AWS Batch, and it can be executed as containerized applications running on Amazon ECS container instances in an ECS cluster. Containerized jobs can reference a container image, command, and parameters.

When customers submit a job to an AWS Batch job queue, the job enters the SUBMITTED state. It then passes through the following states until it succeeds (exits with code 0) or fails (exits with a non-zero code). AWS Batch jobs can have the following states:

SUBMITTED: A job that has been submitted to the queue, and has not yet been evaluated by the scheduler. The scheduler evaluates the job to determine if it has any outstanding dependencies on the successful completion of any other jobs. If there are dependencies, the job is moved to PENDING. If there are no dependencies, the job is moved to RUNNABLE.

PENDING: A job that resides in the queue and is not yet able to run due to a dependency on another job or resource. After the dependencies are satisfied, the job is moved to RUNNABLE.

RUNNABLE: A job that resides in the queue, has no outstanding dependencies, and is therefore ready to be scheduled to a host. Jobs in this state are started as soon as sufficient resources are available in one of the compute environments that are mapped to the job’s queue. However, jobs can remain in this state indefinitely when sufficient resources are unavailable.

STARTING: These jobs have been scheduled to a host and the relevant container initiation operations are underway. After the container image is pulled and the container is up and running, the job transitions to RUNNING.

RUNNING: The job is running as a container job on an Amazon ECS container instance within a compute environment. When the job’s container exits, the process exit code determines whether the job succeeded or failed. An exit code of 0 indicates success, and any non-zero exit code indicates failure. If the job associated with a failed attempt has any remaining attempts left in its optional retry strategy configuration, the job is moved to RUNNABLE again.

SUCCEEDED: The job has successfully completed with an exit code of 0. The job state for SUCCEEDED jobs is persisted in AWS Batch for 24 hours.

FAILED: The job has failed all available attempts. The job state for FAILED jobs is persisted in AWS Batch for 24 hours.

The diagram shows a VPC that has been configured with subnets in multiple Availability Zones. 1A, 2A, and 3A are instances in the VPC.
The diagram shows a VPC that has been configured with subnets in multiple Availability Zones. 1A, 2A, and 3A are instances in the VPC.

AWS Batch Integration

 

 

Pegasus WMS

Pegasus WMS is a scientific workflow management system that can manage the execution of complex workflows on distributed resources. It is funded by National Science Foundation. Scientific workflows allow users to easily express multi-step computational tasks, for example retrieve data from an instrument or a database, reformat the data, and run an analysis. A scientific workflow describes the dependencies between the tasks and in most cases the workflow is described as a directed acyclic graph (DAG), where the nodes are tasks and the edges denote the task dependencies.

  • The Pegasus project encompasses a set of technologies that help workflow-based applications execute in a number of different environments including desktops, campus clusters, grids, and clouds. 
  • Pegasus bridges the scientific domain and the execution environment by automatically mapping high-level workflow descriptions onto distributed resources.
  • It automatically locates the necessary input data and computational resources necessary for workflow execution.
  • Pegasus enables scientists to construct workflows in abstract terms without worrying about the details of the underlying execution environment or the particulars of the low-level specifications required by the middleware (Condor, Globus, or Amazon EC2).
  • Pegasus also bridges the current cyberinfrastructure by effectively coordinating multiple distributed resources.
  • Pegasus WMS has been used in a number of scientific domains including astronomy, bioinformatics, earthquake science , gravitational wave physics, ocean science, limnology, and others.
Luigi

Luigi is a workflow management system to efficiently launch a group of tasks with defined dependencies between them. There are two fundamental building blocks of Luigi – the Task class and the Target class. Both are abstract classes and expect a few methods to be implemented. In addition to those two concepts, the Parameter class is an important concept that governs how a Task is run.

  • Target: The Target class corresponds to a file on a disk, a file on HDFS or some kind of a checkpoint, like an entry in a database. Actually, the only method that Targets have to implement is the exists method which returns True if and only if the Target exists.

  • Task: The Task class is a bit more conceptually interesting because this is where computation is done. There are a few methods that can be implemented to alter its behavior, most notably run()output() and requires().

  • Luigi not only can be in a Python based API that builds and executes pipelines of Hadoop jobs, but it can also be used to create workflows with any external jobs written in R or Scala or Spark.
Nextflow

Nextflow is a relatively light weighted java application that a single user can easily manage. Nextflow workflow is easy to run any analysis while transparently managing all of the issues that tend to crop up when running a shell script (missing dependencies, not enough resources, hard to tell where failures are coming from, not easily published or transferred to collaborators)

  • Nextflow has robust reporting features, real time status updates and workflow failure or completion handling and notifications via a user-definable on-completion process.
Apache Airflow

Apache Airflow also known as Airflow is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

  • Customers can use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. 
  • The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. 
  • Rich command line utilities make performing complex surgeries on DAGs a snap. 
  • It has a rich user interface, that makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Step Functions

AWS Step Functions enables customers to coordinate multiple AWS services into serverless workflows, so that they can build and update apps quickly. 

  • Using Step Functions, customers can design and run workflows that stitch together services, such as AWS Lambda, AWS Fargate, and Amazon SageMaker, into feature-rich applications. 
  • Workflows are made up of a series of steps, with the output of one step acting as input into the next.
Metaflow

Metaflow is a human-friendly Python library that helps scientists and engineers build and manage real-life data science projects. 

  • Metaflow provides a unified API to the infrastructure stack that is required to execute data science projects, from prototype to production.
  • Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

AWS Batch Use cases

 

Financial Services

 
 
 

Financial Services organizations, from fintech startups to longstanding enterprises, have been utilizing batch processing in areas such as high performance computing for risk management, end-of-day trade processing, and fraud surveillance. Users can use AWS Batch to minimize human error, increase speed and accuracy, and reduce costs with automation.

High performance computing: The Financial Services industry has advanced the use of high performance computing in areas such as pricing, market positions, and risk management. By taking these compute-intensive workloads onto AWS, organizations have increased speed, scalability, and cost-savings.
  • With AWS Batch, organizations can automate the resourcing and scheduling these jobs to save costs and accelerate decision-making and go-to-market speeds.
 
Post-trade analytics: Trading desks are constantly looking for opportunities to improve their positions by analyzing the day’s transaction costs, execution reporting, and market performance, among other areas. All of this requires require batch processing of large data sets from multiple sources after the trading day closes.
  • AWS Batch enables the automation of these workloads so that users can understand the pertinent risk going into the next day’s trading cycle and make better decisions based on data.
 
Fraud surveillance: Fraud is an ongoing concern impacting all industries, especially Financial Services. Amazon Machine Learning enables more intelligent ways to analyze data using algorithms and models to combat this challenge.
  • When used in conjunction with AWS Batch, organizations can automate the data processing or analysis required to detect irregular patterns in users data that could be an indicator of fraudulent activity such as money laundering and payments fraud.

The scientific insight that allows Biopharmaceutical and Genomics companies to bring products to market demand high performance computing environments. AWS Batch can be applied throughout your organization in applications such as computational chemistry, clinical modelling, molecular dynamics, and genomic sequencing testing and analysis.

Drug screening: AWS Batch allows research scientists involved in drug discovery to more efficiently and rapidly search libraries of small molecules in order to identify those structures which are most likely to bind to a drug target, typically a protein receptor or enzyme.
  • By doing this, scientists can capture better data to begin drug design and have a deeper understanding for the role of a particular biochemical process, which could potentially lead to the development of more efficacious drugs and therapies.
DNA sequencing: After bioinformaticians complete their primary analysis of a genomic sequence to produce the raw files, they can use AWS Batch to complete their secondary analysis.
  • With AWS Batch, customers can simplify and automate the assembly of the raw DNA reads into a complete genomic sequence by comparing the multiple overlapping reads and the reference sequence, as well as potentially reduce data errors caused by incorrect alignment between the reference and the sample.
AWS Batch

Life sciences

 
 

 

Digital media

 
 

Media and Entertainment companies require highly scalable batch computing resources to enable accelerated and automated processing of data as well as the compilation and processing of files, graphics, and visual effects for high-resolution video content. Use AWS Batch to accelerate content creation, dynamically scale media packaging, and automate asynchronous media supply chain workflows.

Rendering: AWS Batch provides content producers and post-production houses with tools to automate content rendering workloads and reduces the need for human intervention due to execution dependencies or resource scheduling. This includes the scaling of compute cores in a render farm, utilizing Spot Instances, and coordinating the execution of disparate steps in the process.
 
Transcoding: AWS Batch accelerates batch and file based transcoding workloads by automating workflows, overcoming resource bottlenecks, and reducing the number of manual processes by scheduling and monitoring the executing of asynchronous processes, then triggering conditional responses to scale resources for a given workload when necessary.
 
Media supply chain: AWS Batch simplifies complex media supply chain workflows by coordinating the execution of disparate and dependent jobs at different stages of processing, and supports a common framework for managing content preparation for different contributors to the media supply chain.
AWS Batch

CloudWatch Events

Users can use the AWS Batch event stream for CloudWatch Events to receive near real-time notifications regarding the current state of jobs that have been submitted to your job queues. Using CloudWatch Events, users can monitor the progress of jobs, build AWS Batch custom workflows with complex dependencies, generate usage reports or metrics around job execution, or build own custom dashboards. AWS Batch and CloudWatch Events eliminate scheduling and monitoring code that continuously polls AWS Batch for job status changes. Instead, handle AWS Batch job state changes asynchronously using any CloudWatch Events target, such as AWS Lambda, Amazon Simple Queue Service, Amazon Simple Notification Service, or Amazon Kinesis Data Streams.

  • Events from the AWS Batch event stream are ensured to be delivered at least one time. In the event that duplicate events are sent, the event provides enough information to identify duplicates.
  • AWS Batch jobs are available as CloudWatch Events targets. Using simple rules that you can quickly set up, you can match events and submit AWS Batch jobs in response to them. 
  • AWS Batch sends job status change events to CloudWatch Events. AWS Batch tracks the state of your jobs. If a previously submitted job’s status changes, an event is triggered, for example, if a job in the RUNNING status moves to the FAILED status. These events are classified as job state change events.
  • Amazon CloudWatch Events delivers a near real-time stream of system events that describe changes in Amazon Web Services resources. AWS Batch jobs are available as CloudWatch Events targets.
  • Users can configure jobs to send log information to CloudWatch Logs. Which enables to view different logs from jobs in one convenient location.
  • AWS Batch is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in AWS Batch. CloudTrail captures all API calls for AWS Batch as events. The calls captured include calls from the AWS Batch console and code calls to the AWS Batch API operations.

AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. AWS Batch dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory optimized instances) based on the volume and specific resource requirements of the batch jobs submitted. With AWS Batch, there is no need to install and manage batch computing software or server clusters. AWS Batch plans, schedules, and executes AWS customers batch computing workloads using Amazon EC2 and Spot Instances.