Amazon Redshift

Amazon Redshift is the most widely used cloud data warehouse. It is fast, simple and cost-effective to analyze all data using standard SQL and existing Business Intelligence (BI) tools. It allows users to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Most results come back in seconds. With Redshift, users can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional on-premises solutions. Amazon Redshift also includes Amazon Redshift Spectrum, allowing users to run SQL queries directly against exabytes of unstructured data in Amazon S3 data lakes.

  • In Amazon Redshift, no loading or transformation is required, and customers can use open data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, Hudi, Delta and TSV.
  • Redshift Spectrum automatically scales query compute capacity based on the data retrieved, so queries against Amazon S3 run fast, regardless of data set size.
  • Amazon Redshift gives fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources. 
Amazon Redshift

Amazon Redshift Benefits

Redshift Spectrum enables users to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required. When a user issue a query, it goes to the Amazon Redshift SQL endpoint, which generates and optimizes a query plan. Amazon Redshift determines what data is local and what is in Amazon S3, generates a plan to minimize the amount of Amazon S3 data that needs to be read, requests Redshift Spectrum workers out of a shared resource pool to read and process data from Amazon S3. Redshift Spectrum scales out to thousands of instances if needed, so queries run quickly regardless of data size. 

AQUA is a new distributed and hardware-accelerated cache that enables Redshift queries to run up to 10x faster than other cloud data warehouses. Existing data warehousing architectures with centralized storage require data be moved to compute clusters for processing. AQUA takes a new approach to cloud data warehousing. AQUA brings the compute to storage by doing a substantial share of data processing in-place on the innovative cache. In addition, it uses AWS-designed processors and a scale-out architecture to accelerate data processing beyond anything traditional CPUs can do today. 

 

Amazon Redshift integrates with AWS CloudTrail to enable users to audit all Redshift API calls. Redshift logs all SQL operations, including connection attempts, queries, and changes to data warehouse. Users can access these logs using SQL queries against system tables, or choose to save the logs to a secure location in Amazon S3. CloudTrail provides a record of API operations taken by a IAM user, role, or an AWS service in Amazon Redshift. Using the information collected by CloudTrail, users can determine the request that was made to Amazon Redshift, the IP address from which the request was made, who made the request, when it was made, and additional details.

Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse. For example, provisioning the infrastructure capacity, automating ongoing administrative tasks such as backups, and patching, and monitoring nodes and drives to recover from failures. Redshift also has automatic tuning capabilities, and surfaces recommendations for managing the warehouse in Redshift Advisor. For Redshift Spectrum, Amazon Redshift manages all the computing infrastructure, load balancing, planning, scheduling and execution of users queries on data stored in Amazon S3.

Amazon Redshift Features

Amazon Redshift lets users quickly and simply work with the data in open formats, and easily integrates with and connects to the AWS ecosystem.

Query and export data to and from your data lake: No other cloud data warehouse makes it as easy to both query data and write data back to users data lake in open formats. Users can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in S3 using familiar American National Standards Institute (ANSI) SQL. Which gives the flexibility to store highly structured, frequently accessed data in a Redshift data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in S3.

Federated Query: With the new federated query capability in Redshift, users can reach into the operational, relational database. Query live data across one or more Amazon RDS and Aurora PostgreSQL and in preview RDS MySQL and Aurora MySQL databases to get instant visibility into the end-to-end business operations without requiring data movement.
  • Users can join data from the Redshift data warehouse, data in the data lake, and now data in the operational stores to make better data-driven decisions.
  • Redshift offers sophisticated optimizations to reduce data moved over the network and complements it with its massively parallel data processing for high-performance queries.

Redshift ML (preview): Redshift ML is a new capability for Amazon Redshift that make it easy for data analysts and database developers to create, train, and deploy Amazon SageMaker models using SQL. With Amazon Redshift ML, users can use SQL statements to create and train Amazon SageMaker models on their data in Amazon Redshift and then use those models for predictions such as churn detection and risk scoring directly in their queries and reports. Visit the Redshift documentation to learn how to get started

AWS analytics ecosystem: Native integration with the AWS analytics ecosystem makes it easier to handle end-to-end analytics workflows without friction. 

  • AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. 
  • AWS Glue can extract, transform, and load (ETL) data into Redshift.
  • Amazon Kinesis Data Firehose is the easiest way to capture, transform, and load streaming data into Redshift for near real-time analytics.
  • Users can use Amazon EMR to process data using Hadoop/Spark and load the output into Amazon Redshift for BI and analytics. 
  • Amazon QuickSight is the first BI service with pay-per-session pricing that users can use to create reports, visualizations, and dashboards on Redshift data.
  • With Amazon SageMaker, users can use Redshift to prepare your data to run machine learning workloads .
  • Using the AWS Schema Conversion tool and the AWS Database Migration Service (DMS), users can accelerate migrations to Amazon Redshift.
  • With Amazon Key Management Service (KMS) and Amazon CloudWatch, users can use deeply integrated Amazon Redshift for security, monitoring, and compliance.
  • Using Lambda UDFs, users can invoke a Lambda function from the SQL queries as if they are invoking a User Defined Function in Redshift. Users can write Lambda UDFs to integrate with AWS partner services and to access other popular AWS services such as Amazon DynamoDB or Amazon SageMaker.

Redshift partner console integration (preview): Users  can accelerate data onboarding and create valuable business insights in minutes by integrating with select partner solutions in the Redshift console. With these solutions users can bring data from applications like Salesforce, Google Analytics, Facebook Ads, Slack, Jira, Splunk, and Marketo into their Amazon Redshift data warehouse in an efficient and streamlined way. It also enables users to join these disparate datasets and analyze them together to produce actionable insights.

Integrated with third-party tools: There are many options to enhance Amazon Redshift by working with industry-leading tools and experts for loading, transforming, and visualizing data. AWS extensive list of Partners have certified their solutions to work with Amazon Redshift.

Amazon Redshift automates common maintenance tasks on the data warehouse.

Automated provisioning: Users can deploy a new data warehouse with just a few clicks in the AWS console, and Amazon Redshift automatically provisions the infrastructure. Most administrative tasks are automated, such as backups and replication. New capabilities are released transparently, eliminating the need to schedule and apply upgrades and patches.

Automated backups: Data in Amazon Redshift is automatically backed up to Amazon S3, and Amazon Redshift can asynchronously replicate the snapshots to S3 in another region for disaster recovery. Users cluster is available as soon as the system metadata has been restored, and they can start running queries while data is spooled down in the background.

Automated Table Design: Amazon Redshift continuously monitors user workloads and uses sophisticated algorithms to find ways to improve the physical layout of data to optimize query speeds. Automatic Table Optimization selects the best sort and distribution keys to optimize performance for the cluster’s workload.

  • Automatic Vacuum Delete, Automatic Table Sort, and Automatic Analyze eliminate the need for manual maintenance and tuning of Redshift clusters to get the best performance for new clusters and production workloads.  

Fault tolerant: There are multiple features that enhance the reliability of data warehouse cluster.  Amazon Redshift continuously monitors the health of the cluster, and automatically re-replicates data from failed drives and replaces nodes as necessary for fault tolerance. Clusters can also be relocated to alternative Availability Zones (AZ’s) without any data loss or application changes.

Flexible querying: Amazon Redshift gives users the flexibility to execute queries within the console or connect SQL client tools, libraries, or Business Intelligence tools. The Query Editor on the AWS console provides a powerful interface for executing SQL queries on Amazon Redshift clusters and viewing the query results and query execution plan (for queries executed on compute nodes) adjacent to users queries.

Native support for advanced analytics: Redshift supports standard scalar data types such as NUMBER, VARCHAR, and DATETIME and provides native support for the following advanced analytics processing:

  • Spatial data processing: Amazon Redshift provides a polymorphic data type, GEOMETRY, which supports multiple geometric shapes such as Point, Linestring, Polygon etc. Redshift also provides spatial SQL functions to construct geometric shapes, import, export, access and process the spatial data. Users can add GEOMETRY columns to Redshift tables and write SQL queries spanning across spatial and non-spatial data. This capability enables users to store, retrieve, and process spatial data and seamlessly enhance business insights by integrating spatial data into analytical queries. With Redshift’s ability to seamlessly query data lakes, users can extend spatial processing to data lakes by integrating external tables in spatial queries. 
  • HyperLogLog sketches: HyperLogLog is a novel algorithm that efficiently estimates the approximate number of distinct values in a data set. HLL sketch is a construct that encapsulates the information about the distinct values in the data set. Users can use HLL sketches to achieve significant performance benefits for queries that compute approximate cardinality over large data sets, with an average relative error between 0.01–0.6%. Redshift provides a first class datatype HLLSKETCH and associated SQL functions to generate, persist, and combine HyperLogLog sketches. The Amazon Redshift’s HyperLogLog capability uses bias correction techniques and provides high accuracy with low memory footprint. 
  • DATE & TIME data types: Amazon Redshift provides multiple data types DATE, TIME, TIMETZ, TIMESTAMP and TIMESTAMPTZ to natively store and process data/time data. TIME and TIMESTAMP types store the time data without time zone information, whereas TIMETZ and TIMESTAMPTZ types store the time data including the timezone information. Users can use various date/time SQL functions to process the date and time values in Redshift queries. 

Semi-structured data processing: The Amazon Redshift SUPER data type (preview) natively stores semi-structured data in Redshift tables, and uses the PartiQL query language to seamlessly process the semi-structured data.

  • The SUPER data type is schemaless in nature and allows storage of nested values that may contain Redshift scalar values, nested arrays and nested structures.
  • PartiQL is an extension of SQL and provides powerful querying capabilities such as object and array navigation, unnesting of arrays, dynamic typing, and schemaless semantics. This enables users to achieve advanced analytics that combine the classic structured SQL data with the semi-structured SUPER data with superior performance, flexibility and ease-of-use. 

Amazon Redshift offers fast, industry-leading performance with flexibility.

RA3 instances: RA3 instances deliver up to 3x better price performance of any cloud data warehouse service. These Amazon Redshift instances maximize speed for performance-intensive workloads that require large amounts of compute capacity, with the flexibility to pay separately for compute independently of storage by specifying the number of the instances need. 

AQUA (Advanced Query Accelerator): AQUA is a hardware accelerated cache that delivers up to 10x better query performance than other cloud data warehouses. 

Efficient storage and high performance query processing: Amazon Redshift delivers fast query performance on datasets ranging in size from gigabytes to petabytes. Columnar storage, data compression, and zone maps reduce the amount of I/O needed to perform queries. Along with the industry standard encodings such as LZO and Zstandard, Amazon Redshift also offers purpose-built compression encoding, AZ64, for numeric and date/time types to provide both storage savings and optimized query performance.

Materialized views: Amazon Redshift materialized views allow users to achieve significantly faster query performance for analytical workloads such as dashboarding, queries from Business Intelligence (BI) tools, and Extract, Load, Transform (ELT) data processing jobs.

  • Users can use materialized views to cache intermediate results in order to speed up slow-running queries.
  • Amazon Redshift can efficiently maintain the materialized views incrementally to continue to provide the low latency performance benefits. 

Machine learning to maximize throughput and performance: Advanced machine learning capabilities in Amazon Redshift deliver high throughput and performance, even with varying workloads or concurrent user activity. Amazon Redshift utilizes sophisticated algorithms to predict and classify incoming queries based on their run times and resource requirements to dynamically manage performance and concurrency while also helping users to prioritize the business critical workloads.

  • Short query acceleration (SQA) sends short queries from applications such as dashboards to an express queue for immediate processing rather than being starved behind large queries.
  • Automatic workload management (WLM) uses machine learning to dynamically manage memory and concurrency, helping maximize query throughput.

Amazon Redshift is a self-learning system that observes the user workload continuously, determining the opportunities to improve performance as the usage grows, applying optimizations seamlessly, and making recommendations via Redshift Advisor when an explicit user action is needed to further turbo charge Amazon Redshift performance. 

Result caching: Amazon Redshift uses result caching to deliver sub-second response times for repeat queries. Dashboard, visualization, and business intelligence tools that execute repeat queries experience a significant performance boost. 

Using Amazon Redshift as cloud data warehouse gives users flexibility to pay for compute and storage separately, the ability to pause and resume cluster, predictable costs with controls, and options to pay as you go or save up to 75% with a Reserved Instance commitment.

Flexible pricing options: Amazon Redshift is the most cost-effective data warehouse, and users have choices to optimize how to pay for the data warehouse. Customers can start small for just $0.25 per hour with no commitments, and scale out for just $1000 per terabyte per year.

  • Amazon Redshift is the only cloud data warehouse that offers On-Demand pricing with no up-front costs, Reserved Instance pricing which can save you up to 75% by committing to a 1- or 3-year term, and per-query pricing based on the amount of data scanned in users Amazon S3 data lake.
  • Amazon Redshift’s pricing includes built-in security, data compression, backup storage, and data transfer. As the size of data grows use managed storage in the RA3 instances to store data cost-effectively at $0.024 per GB per month.

Predictable cost, even with unpredictable workloads: Amazon Redshift allows customers to scale with minimal cost-impact, as each cluster earns up to one hour of free Concurrency Scaling credits per day. These free credits are sufficient for the concurrency needs of 97% of customers. Which provides customers with predictability in month-to-month cost, even during periods of fluctuating analytical demand. 

Choose node type to get the best value for the workloads: Customers can select from three instance types to optimize Amazon Redshift for the data warehousing needs.

  • RA3 nodesenable users to scale storage independently of compute. With RA3 users get a high performance data warehouse that stores data in a separate storage layer. Users only need to size the data warehouse for the query performance that is necessary. 
  • Dense Compute (DC) nodes allow users to create very high-performance data warehouses using fast CPUs, large amounts of RAM, and solid-state disks (SSDs) and are the best choice for less than 500GB of data.
  • DS2 (Dense Storage) nodes enable users to create large data warehouses using hard disk drives (HDDs) for a low price point when purchasing the 3-year Reserved Instances. Most customers who run on DS2 clusters can migrate their workloads to RA3 clusters and get up to 2x performance and more storage for the same cost as DS2.
    Scaling cluster or switching between node types requires a single API call or a few clicks in the AWS Console.

Data transfer: There is no charge for data transferred between Amazon Redshift and Amazon S3 within the same AWS Region for backup, restore, load, and unload operations. For all other data transfers into and out of Amazon Redshift, users will be billed at standard AWS data transfer rates.

  • Iif running Amazon Redshift cluster in Amazon VPC, users will see standard AWS data transfer charges for data transfers over JDBC/ODBC to their Amazon Redshift cluster endpoint.
  • When using Enhanced VPC Routing and unload data to Amazon S3 in a different region, users will incur standard AWS data transfer charges. 

Whether scaling data, or users, Amazon Redshift is virtually unlimited. 

Petabyte-scale data warehousing: Amazon Redshift is simple and quickly scales as users needs change. With a few clicks in the console or a simple API call, users can easily change the number or type of nodes in the data warehouse, and scale up or down as the needs change.

Petabyte-scale data lake analytics: Users can run queries against petabytes of data in Amazon S3 without having to load or transform any data with the Redshift Spectrum feature. Users can use S3 as a highly available, secure, and cost-effective data lake to store unlimited data in open data formats.

  • Amazon Redshift Spectrum executes queries across thousands of parallelized nodes to deliver fast results, regardless of the complexity of the query or the amount of data.  

Limitless concurrency: Amazon Redshift provides consistently fast performance, even with thousands of concurrent queries, whether they query data in Amazon Redshift data warehouse, or directly in Amazon S3 data lake.

  • Amazon Redshift Concurrency Scaling supports virtually unlimited concurrent users and concurrent queries with consistent service levels by adding transient capacity in seconds as concurrency increases. 

Data sharing: Amazon Redshift data sharing (preview) enables a secure and easy way to scale by sharing live data across Redshift clusters. Data Sharing improves the agility of organizations by giving instant, granular and high-performance access to data inside any Redshift cluster without the need to copy or move it.  

AWS has comprehensive security capabilities to satisfy the most demanding requirements, and Amazon Redshift provides data security out-of-the-box at no extra cost.

End-to-end encryption: With just a couple of parameter settings, users can set up Amazon Redshift to use SSL to secure data in transit, and hardware-accelerated AES-256 encryption for data at rest. If users choose to enable encryption of data at rest, all data written to disk will be encrypted as well as any backups. 

Network isolation: Amazon Redshift enables users to configure firewall rules to control network access to the data warehouse cluster. Users can run Redshift inside Amazon Virtual Private Cloud (VPC) to isolate the data warehouse cluster in their own virtual network and connect it to the existing IT infrastructure using an industry-standard encrypted IPsec VPN.

Audit and compliance: Amazon Redshift integrates with AWS CloudTrail to enable users to audit all Redshift API calls. Redshift logs all SQL operations, including connection attempts, queries, and changes to the data warehouse. Users can access these logs using SQL queries against system tables, or choose to save the logs to a secure location in Amazon S3. Amazon Redshift is compliant with SOC1, SOC2, SOC3, and PCI DSS Level 1 requirements. 

Tokenization: Amazon Lambda user-defined functions (UDFs) enable customers to use an AWS Lambda function as a UDF in Amazon Redshift and invoke it from Redshift SQL queries. Users can write Lambda UDFs to enable external tokenization, data masking, identification or de-identification of data by integrating with vendors like Protegrity, and protect or unprotect sensitive data based on a user’s permissions and groups, in query time. 

Granular access controls: Granular row and column level security controls ensure users see only the data they should have access to. Amazon Redshift is integrated with AWS Lake Formation, ensuring Lake Formation’s column level access controls are also enforced for Redshift queries on the data in the data lake.

Amazon Redshift performance 

Amazon Redshift uses a variety of innovations to achieve up to ten times better performance than traditional databases for data warehousing and analytics workloads, they include the following:

Columnar Data Storage: Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets.

  • Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance.

Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores.

  • When loading data into an empty table, Amazon Redshift automatically samples the data and selects the most appropriate compression scheme.

Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to data warehouse and enables users to maintain fast query performance as data warehouse grows.

Redshift Spectrum: Redshift Spectrum enables you to run queries against exabytes of data in Amazon S3. There is no loading or ETL required. Even if users don’t store any of their data in Amazon Redshift, they can still use Redshift Spectrum to query datasets as large as an exabyte in Amazon S3. When users issue a query, it goes to the Amazon Redshift SQL endpoint, which generates the query plan.

  • Amazon Redshift determines what data is local and what is in Amazon S3, generates a plan to minimize the amount of Amazon S3 data that needs to be read, requests Redshift Spectrum workers out of a shared resource pool to read and process data from Amazon S3, and pulls results back into Amazon Redshift cluster for any remaining processing.

Materialized views: Materialized views provide significantly faster query performance for repeated and predictable analytical workloads such as dashboards, queries from business intelligence (BI) tools, and ELT (Extract, Load, Transform) data processing. Using materialized views, users can store the pre-computed results of queries and efficiently maintain them by incrementally processing the latest changes made to the source tables.

  • Subsequent queries referencing the materialized views use the pre-computed results to run much faster, and automatic refresh and query rewrite capabilities to simplify and automate the usage of materialized views.
  • Materialized views can be created based on one or more source tables using filters, projections, inner joins, aggregations, grouping, functions, and other SQL constructs.

Scalability: The compute and storage capacity of on-premises data warehouses are limited by the constraints of the on-premises hardware. Redshift gives users the ability to scale compute and storage as needed to meet changing workloads.

Automatic Table Optimization (ATO): ATO is a self-tuning capability helps users achieve the performance benefits of sort and distribution keys without manual effort. ATO continuously observes how queries interact with tables, and uses machine learning to select the best sort and distribution keys to optimize performance for the cluster’s workload.

  • If Redshift determines that applying a key will improve cluster performance, tables will be automatically altered within hours without requiring administrator intervention.
  • Optimizations made by the ATO feature have shown to increase cluster performance by 24% and 34% using the 3TB and 30TB TPC-DS benchmark, respectively, versus a cluster without ATO.
  • Additional features like Automatic Vacuum Delete, Automatic Table Sort, and Automatic Analyze eliminate the need for manual maintenance and tuning of Redshift clusters to get the best performance for new clusters and production workloads.

Amazon Redshift Advisor: Develops customized recommendations to increase performance and optimize costs by analyzing the workload and usage metrics for users cluster. Sign in to the Amazon Redshift console to view Advisor recommendations

Data Warehousing Architecture

Amazon Redshift

a data warehouse is a central repository of information coming from one or more data sources. Data typically flows into a data warehouse from transactional systems and other relational databases, and typically includes structured, semi-structured, and unstructured data. This data is processed, transformed, and ingested at a regular cadence. Users including data scientists, business analysts, and decision-makers access the data through BI tools, SQL clients, and spreadsheets.

Data warehouses and OLTP databases

Data warehouses are optimized for batched write operations and reading high volumes of data, whereas online transaction processing (OLTP) databases are optimized for continuous write operations and high volumes of small read operations. In general, data warehouses employ denormalized schemas like the Star schema and Snowflake schema because of high data throughput requirements, whereas OLTP databases employ highly normalized schemas, which are more suited for high transaction throughput requirements. The Star schema consists of a few large fact tables that reference a number of dimension tables. The Snowflake schema, an extension of the Star schema, consists of dimension tables that are normalized even further.

Analytics pipelines are designed to handle large volumes of incoming streams of data from heterogeneous sources such as databases, applications, and devices. A typical analytics pipeline has the following stages:

  1. Collect data
  2. Store the data
  3. Process the data
  4. Analyze and visualize the data

#01

Data Collection

 
 

At the data collection stage, users should consider different types of data, such as transactional data, log data, streaming data, and Internet of Things (IoT) data. AWS provides solutions for data storage for each of these types of data.

Transactional Data

Transactional data, such as e-commerce purchase transactions and financial transactions, is typically stored in relational database management systems (RDBMS) or NoSQL database systems. The choice of database solution depends on the use case and application characteristics. A NoSQL database is suitable when the data is not well-structured to fit into a defined schema, or when the schema changes very often. An RDBMS solution, on the other hand, is suitable when transactions happen across multiple table rows and the queries require complex joins.

  • Amazon DynamoDB is a fully managed NoSQL database service that can be used as an OLTP store for users applications. Amazon RDS allows users to implement a SQL-based relational database solution for the application.
Log Data

Reliably capturing system-generated logs will help users troubleshoot issues, conduct audits, and perform analytics using the information stored in the logs. Amazon Simple Storage Service (Amazon S3) is a popular storage solution for non-transactional data, such as log data, that is used for analytics. Because it provides 11 9’s of durability (that is, 99.999999999 percent durability), Amazon S3 is also a popular archival solution.

Streaming Data

Web applications, mobile devices, and many software applications and services can generate staggering amounts of streaming data—sometimes terabytes per hour—that need to be collected, stored, and processed continuously. Using Amazon Kinesis services, users can do that simply and at a low cost.

IoT Data

Devices and sensors around the world send messages continuously. Enterprises see a growing need today to capture this data and derive intelligence from it. Using AWS IoT, connected devices interact easily and securely with the AWS cloud. AWS IoT makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB to build applications that gather, process, analyze, and act on IoT data, without having to manage any infrastructure.

Users can store data in either a data warehouse or data mart, as discussed in the following.

Data Warehouse

A data warehouse is a central repository of information coming from one or more data sources. Using data warehouses, users can run fast analytics on large volumes of data and unearth patterns hidden in the data by leveraging BI tools. Data scientists query a data warehouse to perform offline analytics and spot trends. Users across the organization consume the data using ad hoc SQL queries, periodic reports, and dashboards to make critical business decisions.

Data Mart

A data mart is a simple form of data warehouse focused on a specific functional area or subject matter, and contains a subset of data stored in a Data Warehouse. For example, users can have specific data marts for each division in the organization or segment data marts based on regions. Users can build data marts from a large data warehouse, operational stores, or a hybrid of the two. Data marts are simple to design, build, and administer. However, because data marts are focused on specific functional areas, querying across functional areas can become complex because of the distribution.

  1. Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
  2. Independent: Independent data mart is created without the use of a central data warehouse.
  3. Hybrid: This type of data marts can take data from data warehouses or operational systems.

Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are small in size and are more flexible compared to a Data warehouse. The following are benefits of Data Mart:

  • Data Mart helps to enhance user’s response time due to reduction in volume of data
  • It provides easy access to frequently requested data.
  • Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.
  • Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.
  • A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.
  • Data is partitioned and allows very granular access control privileges.
  • Data can be segmented and stored on different hardware/software platforms.

#02

Data Storage

 
 

 

#03

Data Processing

 
 

The collection process provides data that potentially has useful information. Users can analyze the extracted information for intelligence that will help grow the business. This intelligence might, for example, tell users about the end user behavior and the relative popularity of the products. The best practice to gather this intelligence is to load the raw data into a data warehouse to perform further analysis.

To do so, there are two types of processing workflows, batch and real time. The most common forms of processing, online analytic processing (OLAP) and OLTP, each use one of these types. Online analytic processing (OLAP) processing is generally batch-based. In contrast, OLTP systems are oriented towards real-time processing and are generally not well-suited for batch-based processing. When users decouple data processing from the OLTP system, that keep the data processing from affecting the OLTP workload. The following involved in batch processing.

Extract Transform Load (ETL)

ETL is the process of pulling data from multiple sources to load into data warehousing systems. ETL is normally a continuous ongoing process with a well-defined workflow. During this process, data is initially extracted from one or more sources. The extracted data is then cleansed, enriched, transformed, and loaded into a data warehouse. Hadoop framework tools such as Apache Pig and Apache Hive are commonly used in an ETL pipeline to perform transformations on large volumes of data.

Extract Load Transform (ELT)

ELT is a variant of ETL where the extracted data is loaded into the target system first. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when users target system is powerful enough to handle transformations. Amazon Redshift is often used in ELT pipelines because it is highly efficient in performing transformations.

Online Analytical Processing (OLAP)

OLAP systems store aggregated historical data in multidimensional schemas. Used widely in data mining, OLAP systems allow users to extract data and spot trends on multiple dimensions. Because it is optimized for fast joins, Amazon Redshift is often used to build OLAP systems.

The following are what’s involved in real-time processing of data.

Real-Time Processing

Amazon Kinesis as a solution to capture and store streaming data. Users can process this data sequentially and incrementally on a record-by-record basis or over sliding time windows, and use the processed data for a wide variety of analytics including correlations, aggregations, filtering, and sampling. This type of processing is called real-time processing. Information derived from real-time processing gives companies visibility into many aspects of their business and customer activity— such as service usage (for metering or billing), server activity, website clicks, and geolocation of devices, people, and physical goods—and enables them to respond promptly to emerging situations. Real-time processing requires a highly concurrent and scalable processing layer.

  • To process streaming data in real time, users can use AWS Lambda. Lambda can process the data directly from AWS IoT or Amazon Kinesis Streams. Lambda allows users run code without provisioning or managing servers.
  • Amazon Kinesis Client Library (KCL) is another way to process data from Amazon Kinesis Streams. KCL gives users more flexibility than AWS Lambda to batch the incoming data for further processing. Users can also use KCL to apply extensive transformations and customizations in your processing logic.
  • Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture streaming data and automatically load it into Amazon Redshift, enabling near-real-time analytics with existing BI tools and dashboards. Users can define the batching rules with Firehose, and then it takes care of reliably batching the data and delivering to Amazon Redshift.

 

Analytics Pipeline with AWS Services
Analytics Pipeline with AWS Services

After processing the data and making it available for further analysis, users need the right tools to analyze and visualize the processed data. In many cases, users can perform data analysis using the same tools they use for processing data. Users can use tools such as SQL Workbench to analyze data in Amazon Redshift with ANSI SQL. Amazon Redshift also works well with popular third-party BI solutions available on the market.

  • Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets users easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into the applications, portals, and websites.
  • Amazon QuickSight is a fast, cloud-powered BI service that makes it easy to create visualizations, perform ad hoc analysis, and quickly get business insights from the data. Amazon QuickSight is integrated with Amazon Redshift.
  • QuickSight is serverless and can automatically scale to tens of thousands of users without any infrastructure to manage or capacity to plan for. It is also the first BI service to offer pay-per-session pricing, where users only pay when end users access their dashboards or reports, making it cost-effective for large scale deployments.
  • With QuickSight, users can ask business questions of the data in plain language and receive answers in seconds.
  • If using Amazon S3 as a primary storage, a popular way to perform analysis and visualization is to run Apache Spark notebooks on Amazon Elastic MapReduce (Amazon EMR). Using this process, users have the flexibility to run SQL or execute custom code written in languages such as Python and Scala.
  • For another visualization approach, Apache Zeppelin is an open source BI solution that you can run on Amazon EMR to visualize data in Amazon S3 using Spark SQL. Users can also use Apache Zeppelin to visualize data in Amazon Redshift.
Analytics Pipeline with AWS Services 

AWS offers a broad set of services to implement an end-to-end analytics platform. The following figure shows the services discussed preceding and where they fit within the analytics pipeline.

QuickSight dashboards can be accessed from any device
QuickSight dashboards can be accessed from any device

#04

Analysis and Visualization

 
 

 

Data Warehouse Technology Options

There are several options available for building a data warehouse: Roworiented databases, column-oriented databases, and massively parallel processing architectures.

Row-Oriented Databases

Row-oriented databases typically store whole rows in a physical block. High performance for read operations is achieved through secondary indexes. Databases such as Oracle Database Server, Microsoft SQL Server, MySQL, and PostgreSQL are row-oriented database systems. These systems have been traditionally used for data warehousing, but they are better suited for transactional processing (OLTP) than for analytics.

  • To optimize performance of a row-based system used as a data warehouse, developers use a number of techniques, including building materialized views, creating pre-aggregated rollup tables, building indexes on every possible predicate combination, implementing data partitioning to leverage partition pruning by query optimizer, and performing index based joins.
  • Traditional row-based data stores are limited by the resources available on a single machine. Data marts alleviate the problem to an extent by using functional sharding. Users can split the data warehouse into multiple data marts, each satisfying a specific functional area. However, when data marts grow large over time, data processing slows down.
  • In a row-based data warehouse, every query has to read through all of the columns for all of the rows in the blocks that satisfy the query predicate, including columns, which was not chosen. This approach creates a significant performance bottleneck in data warehouses, where users tables have more columns, but queries use only a few.
Column-Oriented Databases

 Column-oriented databases organize each column in its own set of physical blocks instead of packing the whole rows into a block. This functionality allows them to be more I/O efficient for read-only queries because they only have to read those columns accessed by a query from disk (or from memory). This approach makes column-oriented databases a better choice than row-oriented databases for data warehousing.

Row-Oriented vs. Column-Oriented Databases
Row-Oriented vs. Column-Oriented Databases

The Figure above illustrates the primary difference between row-oriented and column-oriented databases. Rows are packed into their own blocks in a roworiented database, and columns are packed into their own blocks in a columnoriented database.

  • After faster I/O, the next biggest benefit to using a column-oriented database is improved compression. Because every column is packed into its own set of blocks, every physical block contains the same data type. When all the data is the same data type, the database can use extremely efficient compression algorithms. As a result, you need less storage compared to a row-oriented database.
  • This approach results in significantly lesser I/O because the same data is stored in fewer blocks. Some column-oriented databases that are used for data warehousing include Amazon Redshift, Vertica, Teradata Aster, and Druid.
Massively Parallel Processing Architectures

An MPP architecture allows users to use all of the resources available in the cluster for processing data, thereby dramatically increasing performance of petabytescale data warehouses. MPP data warehouses allow users improve performance by simply adding more nodes to the cluster.

  • Amazon Redshift, Druid, Vertica, GreenPlum, and Teradata Aster are some of the data warehouses built on an MPP architecture. Open source frameworks such as Hadoop and Spark also support MPP.

Migration

 

 

Data is an invaluable resource in today’s world and it is growing in volume and complexity faster than ever before. Traditional data warehousing systems cannot keep up. With rigid architectures that require significant investment to maintain, update, and secure, they do not give organizations the opportunity to make the most of their data.

Moving to a cloud data warehouse like Amazon Redshift liberates analytics from these limitations. Users can run queries across petabytes of data in the data warehouse and extend into the data lake, which is called a “lake house” approach. The following are the benefits of migrating to Amazon Redshifts:

  • Amazon Redshift is fully managed and simple to use, enabling users to deploy a new data warehouse in minutes and load virtually any type of data from a range of cloud or on-premises data sources. It automates most of the common administrative tasks to manage, monitor, and scale the data warehouse, and delivers fast query performance, improves I/O efficiency, and scales up or down as the performance and capacity needs change. 
  • Having data warehouse on AWS, users can go beyond data loaded on to local disks and query vast amounts of unstructured data in the Amazon S3 data lake – without having to load or transform any data. Users can query data in open formats, including CSV, TSV, Parquet, Sequence, and RCFile. Having data available in open formats in the data lake ultimately gives users the flexibility to use the right engine for different analytical needs.
  • Redshift is the world’s fastest cloud data warehouse and gets faster every year, so users can derive insights from the data even more quickly. For performance intensive workloads, users can use the new RA3 instances to get up to 3x the performance of any cloud data warehouse.

When deciding to migrate from an existing data warehouse to Amazon Redshift, users should consider the following migration strategy:

  • The size of the database and its tables
  • Network bandwidth between the source server and AWS
  • Whether the migration and switchover to AWS will be done in one step or a sequence of steps over time
  • The data change rate in the source system
  • Transformations during migration
  • The partner tool plan to use for migration and ETL
One-Step Migration

One-step migration is a good option for small databases that don’t require continuous operation. Users can extract existing databases as comma separated value (CSV) files, then use services such as AWS Import/Export Snowball to deliver datasets to Amazon S3 for loading into Amazon Redshift. Users then test the destination Amazon Redshift database for data consistency with the source. Once all validations have passed, the database is switched over to AWS.

Two-Step Migration

Two-step migration is commonly used for databases of any size: 

  1. Initial data migration: The data is extracted from the source database, preferably during nonpeak usage to minimize the impact. The data is then migrated to Amazon Redshift by following the one-step migration approach described previously.
  2. Changed data migration: Data that changed in the source database after the initial data migration is propagated to the destination before switchover. This step synchronizes the source and destination databases. Once all the changed data is migrated, users can validate the data in the destination database, perform necessary tests, and if all tests are passed, switch over to the Amazon Redshift data warehouse.
Database Migration Tools

Several tools and technologies for data migration are available. Users can use some of these tools interchangeably, or other third-party or opensource tools available in the market.

AWS Database Migration Service (DMS): AWS Database Migration Service (DMS) is a self-service tool users can use to migrate their data from the most widely used commercial data warehouses to Amazon Redshift. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. When migrating databases to Amazon Redshift, users can use DMS free for six months.

  • AWS Database Migration Service supports both the one-step and the twostep migration processes described preceding. To follow the two-step migration process, users enable supplemental logging to capture changes to the source system. Users can enable supplemental logging at the table or database level.

AWS Professional Services: AWS Professional Services is a global team of experts that can help users realize the desired business outcomes when using the AWS Cloud. They work together with AWS team and users chosen member of the AWS Partner Network (APN) to execute users enterprise cloud computing initiatives.

  • AWS team provides assistance through a collection of offerings which help users achieve specific outcomes related to enterprise cloud adoption. AWS deliver focused guidance through its global specialty practices, which cover a variety of solutions, technologies, and industries.

Amazon Redshift Partners: Many Amazon Redshift Partners have specific offerings, skills, tools, processes, and references to help users de-risk the migration from legacy on-premises data warehouses to AWS.

  • Amazon Redshift Delivery Partners help users load, transform, and analyze data, and architect and implement analytics platforms. These AWS Consulting Partners are validated by the AWS Service Delivery Program for following AWS best practices to deliver AWS services.
  • Amazon Redshift Ready Partners offer products that integrate with Amazon Redshift to enable users to gather data insights and analytics productively, and at scale. These AWS Technology Partner products are validated by the AWS Service Ready Program for integration with Amazon Redshift.

Schema Conversion Tool (SCT): Automatically convert the source data warehouse schema and a majority of the data warehouse code objects. The AWS Schema Conversion Tool makes heterogeneous database migrations predictable by automatically converting the source database schema and a majority of the database code objects, including views, stored procedures, and functions, to a format compatible with the target database.

  • Any objects that cannot be automatically converted are clearly marked so that they can be manually converted to complete the migration. SCT can scan users application source code for embedded SQL statements and convert them as part of a database schema conversion project.
  • During this process, SCT performs cloud native code optimization by converting legacy Oracle and SQL Server functions to their equivalent AWS service thus helping users modernize the applications at the same time of database migration. Once schema conversion is complete, SCT can help migrate data from a range of data warehouses to Amazon Redshift using built-in data migration agents.

 

Amazon Redshift pricing

Amazon Redshift

Amazon Redshift costs less to operate than any other data warehouse. Start small at $0.25 per hour and scale up to petabytes of data and thousands of concurrent users. Users can choose what is right for the  business needs, with the ability to grow storage without over-provisioning compute, and the flexibility to grow compute capacity without increasing storage costs.. Amazon Redshift supports the ability to pause and resume a cluster, allowing users to easily suspend on-demand billing while the cluster is not being used. While the cluster is paused, customers are only charged for the cluster’s storage. For steady-state production workloads, customers can get significant discounts over on-demand pricing by switching to Reserved Instances.

Billing commences for a data warehouse cluster as soon as the data warehouse cluster is available. Billing continues until the data warehouse cluster terminates, which would occur upon deletion or in the event of instance failure. Customers are billed based on the following circumstance:

  • Compute node hours: Compute node hours are the total number of hours run across all customers compute nodes for the billing period. Node usage hours are billed for each hour the data warehouse cluster is running in an available state. If customers no longer wish to be charged for the data warehouse cluster, they need to terminate it to avoid being billed for additional node hours. Partial node hours consumed are billed as full hours. Customers are billed for 1 unit per node per hour, so a 3-node data warehouse cluster running persistently for an entire month would incur 2,160 instance hours. Customers will not be charged for leader node hours; only compute nodes will incur charges.
  • Managed storage: Customers pay for data stored in managed storage at a fixed GB-month rate for their region. Managed storage comes exclusively with RA3 node types and customers pay the same low rate for Redshift managed storage regardless of data size. Usage of managed storage is calculated hourly based on the total data present in the managed storage. Customers can monitor the amount of data in RA3 cluster via Amazon CloudWatch or the AWS Management Console. Customers do not pay for any data transfer charges between RA3 nodes and managed storage. Managed storage charges do not include back up storage charges due to automated and manual snapshots. Once the cluster is terminated, customers continue to be charged for the retention of the manual backups.
  • Backup Storage: Backup storage is the storage associated with automated and manual snapshots for users data warehouse. Increasing the backup retention period or taking additional snapshots increases the backup storage consumed by customers data warehouse. There is no additional charge for backup storage up to 100% of the provisioned storage for an active data warehouse cluster. For example, if there is an active Single Node XL data warehouse cluster with 2TB of local instance storage, AWS will provide up to 2TB a month of backup storage at no additional charge. Backup storage beyond the provisioned storage size and backups stored after users cluster is terminated are billed at standard Amazon S3 rates.
  • Data transfer: There is no data transfer charge for data transferred to or from Amazon Redshift and Amazon S3 within the same AWS Region. For all other data transfers into and out of Amazon Redshift, customers will be billed at standard AWS data transfer rates.
  • Data scanned: With Redshift Spectrum, customers are charged for the amount of Amazon S3 data scanned to execute the query. There are no charges for Redshift Spectrum when queries not running . If the data stored in a columnar format, such as Parquet or RC, customers charges will go down, as Redshift Spectrum will only scan the columns needed by the query, rather than processing entire rows. Similarly, if data compressed, using one of Redshift Spectrum’s supported formats, customers costs will also go down. Customers pay the standard Amazon S3 rates for data storage and Amazon Redshift instance rates for the cluster used.
  • Concurrency Scaling: With Concurrency Scaling, Redshift automatically adds transient capacity to provide consistently fast performance, even with thousands of concurrent users and queries. There are no resources to manage, no upfront costs, and customers are not charged for the startup or shutdown time of the transient clusters. Customers can accumulate one hour of concurrency scaling cluster credits every 24 hours while the main cluster is running. Customers are charged the per-second on-demand rate for a concurrency scaling cluster used in excess of the free credits – only when it’s serving your queries – with a one-minute minimum charge each time a concurrency scaling cluster is activated. The per-second on-demand rate is based on the type and number of nodes in the Amazon Redshift cluster.

Amazon Redshift is the most widely used cloud data warehouse. It is fast, simple and cost-effective to analyze all data using standard SQL and existing Business Intelligence (BI) tools. It allows users to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. Most results come back in seconds. With Redshift, users can start small for just $0.25 per hour with no commitments and scale out to petabytes of data for $1,000 per terabyte per year, less than a tenth the cost of traditional on-premises solutions. Amazon Redshift also includes Amazon Redshift Spectrum, allowing users to run SQL queries directly against exabytes of unstructured data in Amazon S3 data lakes.

  • In Amazon Redshift, no loading or transformation is required, and customers can use open data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, Hudi, Delta and TSV.
  • Redshift Spectrum automatically scales query compute capacity based on the data retrieved, so queries against Amazon S3 run fast, regardless of data set size.
  • Amazon Redshift gives fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.