Amazon Neptune

Amazon Neptune is a fast, reliable, and fully managed graph database service. Amazon Neptune is to build and run applications that work with highly connected datasets. The core of Neptune is a purpose-built, high-performance graph database engine. This engine is optimized for storing billions of relationships and querying the graph with milliseconds latency. Supporting both the Apache Tinkerpop Gremlin stack as well as the RDF 1.1 / SPARQL 1.1 W3C standards, Amazon Neptune efficiently stores and navigates highly connected data, allowing developers to create interactive graph applications that can query billions of relationships with millisecond latency.

As a purpose-built cloud service, Amazon Neptune is seamlessly embedded into the Amazon Web Service (AWS) ecosystem and comes with a broad set of enterprise features including SDKs for deployment and configuration, high availability and scale-up using replication, automatic backup and restore functionality, point in time recovery, monitoring, encryption-at-rest, security using VPCs and integrated access management, as well as audit logs. Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across Availability Zones. Amazon Neptune provides data security features, with support for encryption at rest and in transit. Neptune is fully managed:

  • Primary DB instance – Supports read and write operations, and performs all of the data modifications to the cluster volume. Each Neptune DB cluster has one primary DB instance that is responsible for writing (that is, loading or modifying) graph database contents.
  • Neptune replica – Connects to the same storage volume as the primary DB instance and supports only read operations. Each Neptune DB cluster can have up to 15 Neptune Replicas in addition to the primary DB instance. This provides high availability by locating Amazon Neptune Replicas in separate Availability Zones and distribution load from reading clients.
  • Cluster volume – Amazon Neptune data is stored in the cluster volume, which is designed for reliability and high availability. A cluster volume consists of copies of the data across multiple Availability Zones in a single AWS Region. Because data is automatically replicated across Availability Zones, it is highly durable, and there is little possibility of data loss. 
Amazon Neptune

Amazon Neptune Benefits

Amazon Neptune supports open graph APIs for both Gremlin and SPARQL. It provides high performance for both of these graph models and their query languages. Users can choose the Property Graph (PG) model and its open source query language, or the Apache TinkerPop Gremlin graph traversal language. Or the W3C standard Resource Description Framework (RDF) model and its standard SPARQL Query Language

Users can use Amazon Neptune to create sophisticated, interactive graph applications that can query billions of relationships in milliseconds. SQL queries for highly connected data are complex and hard to tune for performance. With Neptune, users can use the popular graph query languages TinkerPop Gremlin and SPARQL to execute powerful queries that are easy to write and perform well on connected data. This capability significantly reduces code complexity so that users can quickly create applications that process relationships.

 

Amazon Neptune is highly available, durable, and ACID (Atomicity, Consistency, Isolation, Durability) compliant. Neptune is designed to provide greater than 99.99% availability. It features fault-tolerant and self-healing storage built for the cloud that replicates six copies of the data across three Availability Zones. Amazon Neptune continuously backs up the data to Amazon S3, and transparently recovers from physical storage failures. For High Availability, instance failover typically takes less than 30 seconds. 

Amazon Neptune is a purpose-built, high-performance graph database. It is optimized for processing graph queries. Amazon Neptune supports up to 15 low latency read replicas across three Availability Zones to scale read capacity and execute more than one-hundred thousand graph queries per second. Users can easily scale the database deployment up and down from smaller to larger instance types as the change needed. 

Amazon Neptune Features

Performance: Amazon Neptune allows users to build interactive graph applications that can query billions of relationships with millisecond latency. It is optimized for in-memory data processing and can be deployed on different instance types, backed by a durable storage service layer that scales to 100B+ triples. Query processing is fully ACID with immediate consistency. A dedicated bulk load API enables efficient load from S3 buckets in different (possibly compressed) input formats (NTriples, NQuads, RDF/XML, and Turtle for RDF; CSV for Property Graphs). In order to scale query throughput and provide high availability, Amazon Neptune allows to dynamically spin up up to 15 read replicas.

Ease of use: As a fully managed cloud service, Amazon Neptune can be deployed and scaled within minutes, thus significantly easing graph data management operations. Software upgrades are deployed automatically during configurable maintenance windows, and deployments are monitored permanently so as to initiate automatic failovers when problems are detected. Deployments can be triggered and maintained through a Web based console or using cloud SDKs (such as AWS CLI or CloudFormation, allowing to automate Neptune deployments and related infrastructure setup). To provide a high level of security, Amazon Neptune is deployed into VPCs and offers authentication via SigV4 signing1 (SigV4 samples for Gremlin and SPARQL clients are available in Open Source2 ). Orthogonally, encryption-at-rest is supported using the AWS Key Management Service.

Reliability: AWS uses the concepts of regions, where each region is a separate geographic area and has multiple, isolated locations known as Availability Zones (AZs). Neptune maintains 6 copies of the data spread across 3 AZs and, in the rare case of a failure (such as a hardware outage), will automatically attempt to recover the database in a healthy AZ without data loss. In addition to automated restore functionality, the user can take and restore snapshots of the database using the console or SDKs. A variety of operational metrics are available to the database owner through AWS CloudWatch and can be used to define custom alerts and automate infrastructure management. Standards

Compliance: Amazon Neptune is fully compliant with the Apache Tinkerpop Gremlin stack and implementsthe W3C RDF 1.1 / SPARQL 1.1 standards. This allows AWS customers to leverage the broad ecosystem of Open Source and commercial tooling built around the Gremlin and SPARQL stacks. To provide an example, on the SPARQL side existing client libraries such as Apache ARQ/Jena or RDF4J3 can be used out-ofthe-box to interact with Neptune. Complementarily to Open Source projects, the AWS Partner Network (APN) provides value-added services such as graph management, visualization, exploration, search, and application building on top of Amazon Neptune.

Easy to Use
Getting started with Amazon Neptune is easy. Just launch a new Amazon Neptune database instance using the AWS Management Console. Neptune database instances are pre-configured with parameters and settings appropriate for the database instance class that have selected. Users can launch a database instance and connect the application within minutes without additional configuration. Database Parameter Groups provide granular control and fine-tuning of the database.

Easy to Operate
Amazon Neptune makes it easy to operate a high performance graph database. With Neptune, users do not need to create custom indexes over the graph data. Amazon Neptune provides timeout and memory usage limitations to reduce the impact of queries that consume too many resources.

Monitoring and Metrics
Amazon Neptune provides Amazon CloudWatch metrics for the database instances. Users can use the AWS Management Console to view over 20 key operational metrics for the database instances, including compute, memory, storage, query throughput, and active connections.

Automatic Software Patching
Amazon Neptune will keep the database up-to-date with the latest patches. Users can control if and when the instance is patched via Database Engine Version Management.

Database Event Notifications
Amazon Neptune can notify users via email or SMS of important database events like automated failover. Users can use the AWS Management Console to subscribe to different database events associated with Amazon Neptune databases.

Fast Database Cloning
Amazon Neptune supports quick, efficient cloning operations, where entire multi-terabyte database clusters can be cloned in minutes. Cloning is useful for a number of purposes including application development, testing, database updates, and running analytical queries. Immediate availability of data can significantly accelerate the software development and upgrade projects, and make analytics more accurate.

Users can clone an Amazon Neptune database with just a few clicks in the Management Console, without impacting the production environment. The clone is distributed and replicated across 3 Availability Zones.

High Throughput, Low Latency for Graph Queries
Amazon Neptune is a purpose-built, high-performance graph database engine. Amazon Neptune efficiently stores and navigates graph data, and uses a scale-up, in-memory optimized architecture to allow for fast query evaluation over large graphs. With Neptune, users can use either Gremlin or SPARQL to execute powerful queries that are easy to write and perform well.

Easy Scaling of Database Compute Resources
With a few clicks in the AWS Management Console, users can scale the compute and memory resources powering production cluster up or down by creating new replica instances of the desired size, or by removing instances. Compute scaling operations typically complete in a few minutes.

Storage that Automatically Scales
Amazon Neptune will automatically grow the size of the database volume as the database storage needs grow. volume will grow in increments of 10 GB up to a maximum of 64 TB. Users don’t need to provision excess storage for the database to handle future growth.

Low Latency Read Replicas
Increase read throughput to support high volume application requests by creating up to 15 database read replicas. Amazon Neptune replicas share the same underlying storage as the source instance, lowering costs and avoiding the need to perform writes at the replica nodes. This frees up more processing power to serve read requests and reduces the replica lag time – often down to single digit milliseconds. Neptune also provides a single endpoint for read queries so the application can connect without having to keep track of replicas as they are added and removed.

Network Isolation
Amazon Neptune runs in Amazon VPC, which allows to isolate your database in your own virtual network, and connect to the on-premises IT infrastructure using industry-standard encrypted IPsec VPNs. In addition, using Neptune’s VPC configuration, users can configure firewall settings and control network access to the database instances.

Resource-Level Permissions
Amazon Neptune is integrated with AWS Identity and Access Management (IAM) and provides users the ability to control the actions that the AWS IAM users and groups can take on specific Neptune resources including Database Instances, Database Snapshots, Database Parameter Groups, Database Event Subscriptions, and Database Options Groups. In addition, users can tag the Neptune resources, and control the actions that of IAM users and groups can take on groups of resources that have the same tag (and tag value). For example, users can configure the IAM rules to ensure developers are able to modify “Development” database instances, but only Database Administrators can modify and delete “Production” database instances.

Encryption
Amazon Neptune allows you to encrypt your databases using keys to create and control through AWS Key Management Service (KMS). On a database instance running with Neptune encryption, data stored at rest in the underlying storage is encrypted, as are the automated backups, snapshots, and replicas in the same cluster.

Advanced Auditing
Amazon Neptune allows you to log database events with minimal impact on database performance. Logs can later be analyzed for database management, security, governance, regulatory compliance and other purposes. Users can also monitor activity by sending audit logs to Amazon CloudWatch

Supports Property Graph Apache TinkerPop Gremlin
Property Graphs are popular because they are familiar to developers that are used to relational models. Gremlin traversal language provides a way to quickly traverse Property Graphs. Amazon Neptune supports the Property Graph model using the open source Apache TinkerPop Gremlin traversal language and provides a Gremlin Websockets server that supports TinkerPop version 3.3. With Neptune, users can quickly build fast Gremlin traversals over property graphs. Existing Gremlin applications can easily use Neptune by changing the Gremlin service configuration to point to a Neptune instance.

Supports W3C’s Resource Description Framework (RDF) 1.1 and SPARQL 1.1
RDF is popular because it provides flexibility for modeling complex information domains. There are a number of existing free or public datasets available in RDF including Wikidata and PubChem, a database of chemical molecules. Amazon Neptune supports the W3C’s Semantic Web standards of RDF 1.1 and SPARQL 1.1 (Query and Update), and provides an HTTP REST endpoint that implements the SPARQL Protocol 1.1. With Neptune, you can easily use the SPARQL endpoint for both existing and new graph applications.

Property Graph Bulk Loading
Amazon Neptune supports fast, parallel bulk loading for Property Graph data that is stored in S3. You can use a REST interface to specific the S3 location for the data. It uses a CSV delimited format to load data into the Nodes and Edges. See the Neptune Property Graph bulk loading documentation for more details.

RDF Bulk Loading
Amazon Neptune supports fast, parallel bulk loading for RDF data that is stored in S3. You can use a REST interface to specific the S3 location for the data. The N-Triples (NT), N-Quads (NQ), RDF/XML, and Turtle RDF 1.1 serializations are supported. See the Neptune RDF bulk loading documentation for more details.

Pay Only for What You Use
There is no up-front commitment with Amazon Neptune; users simply pay an hourly charge for each instance that is launched. And, when finished with a Neptune database instance, users can easily delete it. Users do not need to over-provision storage as a safety margin, and users only pay for the storage users actually consume. To see more details, visit the Neptune Pricing page.

Instance Monitoring and Repair
The health of your Amazon Neptune database and its underlying EC2 instance is continuously monitored. If the instance powering the database fails, the database and associated processes are automatically restarted. Amazon Neptune recovery does not require the potentially lengthy replay of database redo logs, so the instance restart times are typically 30 seconds or less. It also isolates the database buffer cache from database processes, allowing the cache to survive a database restart.

Multi-AZ Deployments with Read Replicas
On instance failure, Amazon Neptune automates failover to one of up to 15 Neptune replicas that have created in any of three Availability Zones. If no Neptune replicas have been provisioned, in the case of a failure, Amazon Neptune will attempt to create a new database instance automatically.

Fault-tolerant and Self-healing Storage
Each 10GB chunk of a database volume is replicated six ways, across three Availability Zones. Amazon Neptune uses fault-tolerant storage that transparently handles the loss of up to two copies of data without affecting database write availability and up to three copies without affecting read availability. Neptune’s storage is also self-healing; data blocks and disks are continuously scanned for errors and replaced automatically.

Automatic, Continuous, Incremental Backups and Point-in-time Restore
Amazon Neptune’s backup capability enables point-in-time recovery for the instance. This allows users to restore the database to any second during users retention period, up until the last five minutes. Users automatic backup retention period can be configured up to thirty-five days. Automated backups are stored in Amazon S3, which is designed for 99.999999999% durability. Amazon Neptune backups are automatic, incremental, and continuous and have no impact on database performance.

Database Snapshots
Database Snapshots are user-initiated backups of the instance stored in Amazon S3 that will be kept until you explicitly delete them. They leverage the automated incremental snapshots to reduce the time and storage required. Users can create a new instance from a Database Snapshot whenever desired.

Amazon Neptune Use cases

Amazon Neptune is designed to offer greater than 99.99 percent availability. It increases database performance and availability by tightly integrating the database engine with an SSD-backed virtualized storage layer that is built for database workloads. Amazon Neptune storage is fault-tolerant and self-healing. Disk failures are repaired in the background without loss of database availability. 

Knowledge Graphs

Amazon Neptune helps build knowledge graph applications. A knowledge graph allows users to store information in a graph model and use graph queries to enable end users to easily navigate highly connected datasets. Amazon Neptune supports open source and open standard APIs to allow to quickly leverage existing information resources to build the knowledge graphs and host them on a fully managed service. For example, if a user is interested in The Mona Lisa, users can also help them discover other works of art by Leonardo da Vinci, or other works of art located in The Louvre. Using a knowledge graph, users can add topical information to product catalogs, build and query complex models of regulatory rules, or model general information, like Wikidata.

Identity Graphs

Users can use Amazon Neptune to build identity graphs for any identity resolution solutions, including device and social graphs, personalization and recommendations, and pattern detection. Using a graph database for an identity graph enables users to link identifiers and update profiles easily and query at ultra-low latency — enabling faster updates and up-to-date profile data for ad targeting, personalization, analytics, and ad attribution.

Fraud Detection

With Amazon Neptune, users can use relationships to process financial and purchase transactions in near real time to easily detect fraud patterns. Neptune provides a fully managed service to execute fast graph queries to detect that a potential purchaser is using the same email address and credit card as a known fraud case. If users are building a retail fraud detection application, Neptune can help build graph queries to easily detect relationship patterns like multiple people associated with a personal email address, or multiple people sharing the same IP address but residing in different physical addresses.

Recommendation Engines

Amazon Neptune allows to store relationships between information such as customer interests, friends, and purchase history in a graph and quickly query it to make recommendations that are personalized and relevant. For example, with Neptune users can use a highly available graph database to make product recommendations to a user based on which products are purchased by others who follow the same sport and have similar purchase history. Or, users can identify people that have a friend in common, but don’t yet know each other, and make a friendship recommendation.

Social Networking

Amazon Neptune can quickly and easily process large sets of user profiles and interactions to build social networking applications. Neptune enables highly interactive graph queries with high throughput to bring social features into the applications. For example, if you are building a social feed into the application, users can use Neptune to provide results that prioritize showing end users the latest updates from their family, from friends whose updates they ‘Like,’ and from friends who live close to them.

Network / IT Operations

Users can use Amazon Neptune to store a graph of the network and use graph queries to answer questions like how many hosts are running a specific application. Neptune can store and process billions of events to manage and secure the network. When detecting an event that is an anomaly, users can use Neptune to quickly understand how it might affect the network by querying for a graph pattern using the attributes of the event. Users can query Neptune to find other hosts or devices that may be compromised. For example, if detect a malicious file on a host, Neptune can help find the connections between the hosts that spread the malicious file, and enable users to trace it to the original host that downloaded it.

Life Sciences

Amazon Neptune helps users build applications that store and navigate information in the life sciences, and process sensitive data easily using encryption at rest. For example, users can use Neptune to store models of disease and gene interactions, and search for graph patterns within protein pathways to find other genes that may be associated with a disease. Users can model chemical compounds as a graph and query for patterns in molecular structures. Neptune also helps integrate information to tackle challenges in healthcare and life sciences research. Users can use Neptune to create and store data across different systems and topically organize research publications to quickly find relevant information.

 Architecture

#01

Graph databases

 
 

Graph databases, like Amazon Neptune, are purpose-built to store and navigate relationships. They have advantages over relational databases for use cases like social networking, recommendation engines, and fraud detection, where creation in relationships needed between data and quickly query these relationships. There are a number of challenges to building these types of applications using a relational database. Users would need multiple tables with multiple foreign keys. SQL queries to navigate this data would require nested queries and complex joins that quickly become unwieldy, and the queries would not perform well as the data size grows over time.

Neptune uses graph structures such as nodes (data entities), edges (relationships), and properties to represent and store data. The relationships are stored as first order citizens of the data model. This allows data in nodes to be directly linked, dramatically improving the performance of queries that navigate relationships in the data. Neptune’s interactive performance at scale effectively enables a broad set of graph use cases.

  • Graph databases can represent how entities relate by using actions, ownership, parentage, and so on. Whenever connections or relationships between entities are at the core of the data that are tried to model, a graph database is a natural choice. Therefore, graph databases are useful for modeling and querying social networks, business relationships, dependencies, shipping movements, and similar items.
  • Users can use edges to show typed relationships between entities (also called vertices or nodes). Edges can describe parent-child relationships, actions, product recommendations, purchases, and so on. A relationship, or edge, is a connection between two vertices that always has a start node, end node, type, and direction.

Fraud Detection: Another use case for graph databases is detecting fraud. For example, users can track credit card purchases and purchase locations to detect uncharacteristic use. Detecting fraudulent accounts is another example. With Amazon Neptune, users can use relationships to process financial and purchase transactions in near-real time to easily detect fraud patterns. Neptune provides a fully managed service to execute fast graph queries to detect that a potential purchaser is using the same email address and credit card as a known fraud case.

  • When building a retail fraud detection application, Neptune can help build graph queries. These queries can help easily detect relationship patterns, such as multiple people associated with a personal email address or multiple people who share the same IP address but reside in different physical addresses.

With Amazon Neptune, users can store relationships between information categories such as customer interests, friends, and purchase history in a graph. Users can then quickly query it to make recommendations that are personalized and relevant. For example, users can use a highly available graph database to make product recommendations to a user based on which products are purchased by others who follow the same sport and have similar purchase history. Or, users can identify people who have a friend in common, but don’t yet know each other, and make a friendship recommendation.

Knowledge Graphs: Amazon Neptune helps build knowledge graph applications. A knowledge graph lets store information in a graph model and use graph queries to help end users navigate highly connected datasets more easily. Neptune supports open source and open standard APIs so that users can quickly use existing information resources to build the knowledge graphs and host them on a fully managed service. For example, suppose that a user is interested in the Mona Lisa by Leonardo da Vinci. Users can help this user discover other works of art by the same artist or other works located in The Louvre. Using a knowledge graph, users can add topical information to product catalogs, build and query complex models of regulatory rules, or model general information, like Wikidata.

Life Sciences: Amazon Neptune helps build applications that store and navigate information in the life sciences, and process sensitive data easily using encryption at rest. For example, users can use Neptune to store models of disease and gene interactions. Users can search for graph patterns within protein pathways to find other genes that might be associated with a disease. Users can model chemical compounds as a graph and query for patterns in molecular structures. Neptune helps integrate information to tackle challenges in healthcare and life sciences research. Users can use Neptune to create and store patient relationships from medical records across different systems. Users can topically organize research publications to find relevant information quickly.

Network / IT Operations: Users can use Amazon Neptune to store a graph of your network. Users can then use graph queries to answer questions like how many hosts are running a specific application. Neptune can store and process billions of events to manage and secure the network. If you detect an event, users can use Neptune to quickly understand how it might affect the network by querying for a graph pattern using the attributes of the event. Users can issue graph queries to Neptune to find other hosts or devices that may be compromised. For example, if detect a malicious file on a host, Neptune can help find the connections between the hosts that spread the malicious file. It can help users trace it to the original host that downloaded it.

Neptune supports two different graph query languages: Gremlin (Apache TinkerPop3) and SPARQL (SPARQL 1.1).

  • Gremlin is a graph traversal language and, as such, a query in Gremlin is a traversal made up of discrete steps. Each step follows an edge to a node.
  • SPARQL is a declarative query language based on graph pattern-matching standardized by the W3C.

Amazon Neptune ML is a new capability of Neptune that uses Graph Neural Networks (GNNs), a machine learning technique purpose-built for graphs, to make easy, fast, and more accurate predictions using graph data. With Neptune ML, users can improve the accuracy of most predictions for graphs by over 50% when compared to making predictions using non-graph methods.

  • Making accurate predictions on graphs with billions of relationships can be difficult and time consuming. Existing ML approaches such as XGBoost can’t operate effectively on graphs because they are designed for tabular data. As a result, using these methods on graphs can take time, require specialized skills from developers, and produce sub-optimal predictions.
  • Using the Deep Graph Library (DGL), an open-source library to which AWS contributes, that makes it easy to apply deep learning to graph data, Neptune ML automates the heavy lifting of selecting and training the best ML model for graph data, and lets users run machine learning on their graph directly using Neptune APIs and queries. As a result, users can now create, train, and apply ML on Amazon Neptune data in hours instead of weeks without the need to learn new tools and ML technologies.

There is often valuable information in large connected datasets that can be hard to extract using queries based on human intuition alone. Machine learning (ML) techniques can help find hidden correlations in graphs with billions of relationships. These correlations can be helpful for recommending products, predicting credit worthiness, identifying fraud, and many other things. 

The Neptune ML feature makes it possible to build and train useful machine learning models on large graphs in hours instead of weeks. To accomplish this, Neptune ML uses graph neural network (GNN) technology powered by Amazon SageMaker and the Deep Graph Library (DGL) (which is open-source). Graph neural networks (GNNs) are an emerging field in artificial intelligence (see, for example, A Comprehensive Survey on Graph Neural Networks). For a hands-on tutorial about using GNNs with DGL, see Learning graph neural networks with Deep Graph Library Neptune ML can train machine learning models to support three different categories of inference:

Node classification      

This task involves predicting the categorical feature of a vertex property. For example, given the movie The Shawshank Redemption, Neptune ML can predict its genre property as story from a candidate set of [story, crime, action, fantasy, drama, family, ...].

There are two types of node-classification tasks:

  • Single-class classification: In this kind of task, each node has only one target feature. For example, the property, Place_of_birth of Alan Turing has the value UK.
  • Multi-class classification: In this kind of task, each node can have more than one target feature. For example, the property genre of the film The Godfather has the values crime and story.
Node regression 

This task involves predicting a numerical property of a vertex. For example, given the movie Avengers: Endgame, Neptune ML can predict that its property popularity has a value of 5.0.

Link prediction   

This task involves predicting the most likely destination nodes for a particular source node and outgoing edge, or the most likely source nodes for a given destination node and incoming edge. For example, with a Drug-Disease knowledge graph, given Aspirin as the source node, and treats as the outgoing edge, Neptune ML can predict the most relevant destination nodes as heart diseasefever, and so on.

  • Or, with the Wikimedia knowledge graph, given President-of as the edge or relation and United-States as the destination node, Neptune ML can predict the most relevant heads as George WashingtonAbraham LincolnFranklin D. Roosevelt, and so on.

With Neptune ML, you can use machine learning models that fall in two general categories:

Types of machine learning model currently supported by Neptune ML

  • Knowledge-Graph Embedding (KGE) models   –   These include TransEDistMult, and RotatE models. They only work for link prediction.

 

#02

Machine Learning

 
 

 

#03

Neptune Streams

 
 

Amazon Neptune now supports Streams, an easy way to capture changes in your graph. When enabled, Neptune Streams logs changes to the graph (change-log data) as they happen. Neptune Streams are useful when users want to notify processes (e.g. trigger a lambda) as changes occur in your graph. Streams can also be useful to maintain a current version of the graph in a different region or service such as the Amazon Elasticsearch Service, Amazon ElastiCache, or Amazon Simple Storage Service (S3).

Neptune Streams is available in lab mode in this release. Lab mode allows to test a feature before they are available in production. Neptune Streams can be enabled or disabled in lab mode using the database cluster parameter neptune_lab_mode. Once enabled, users can access Neptune Streams using the HTTP GET requests to REST APIs /sparql/streams or /gremlin/streams. The response will be a JSON feed of the operations and the changes to the graph.

When Neptune Streams are enabled, users incur I/O and storage charges associated with the change-log data. Change records are purged automatically a week after they are created. Users can try out Neptune Streams today. Please refer to the Amazon Neptune User Guide on Streams for more details

Neptune Streams logs every change to your graph as it happens, in the order that it is made, in a fully managed way. Once enabled Streams, Neptune takes care of availability, backup, security and expiry.

The following are some of the many use cases where you might want to capture changes to a graph as they occur:

  • For application to notify people automatically when certain changes are made.
  • To maintain a current version of your graph data in another data store also, such as Amazon Elasticsearch Service, Amazon ElastiCache, or Amazon Simple Storage Service (Amazon S3).

Neptune uses the same native storage for the change-log stream as for graph data. It writes change log entries synchronously together with the transaction that makes those changes. Users retrieve these change records from the log stream using an HTTP REST API. 

Neptune Streams Guarantees

  • Changes made by a transaction are immediately available for reading from both writer and readers as soon as the transaction is complete (aside from any normal replication lag in readers).
  • Change records appear strictly sequentially, in the order in which they occurred (this includes the changes made within a transaction).
  • The changes streams contain no duplicates. Each change is logged only once.
  • The changes streams are complete. No changes are lost or omitted.
  • The changes streams contain all the information needed to determine the complete state of the database itself at any point in time, provided that the starting state is known.
  • Streams can be turned on or off at any time.

Amazon Neptune uses two different formats for serializing graph-changes data to log streams, depending on whether the graph was created using Gremlin or SPARQL.

SPARQL federated query

Neptune now supports SPARQL 1.1 federated query. Using the SPARQL1.1 SERVICE keyword, customers can execute portions of a query in different SPARQL endpoints within their Virtual Private Cloud (VPC), combine the results and return them to the user.

For example, if a sample query on the person dataset in one cluster retrieves persons with first name “John” and the second query on another cluster retrieves persons with last name “Abercrombie”, users can federate queries using SPARQL 1.1 and retrieve persons with first name “John” and last name “Abercrombie”.

  • Neptune logs changes to SPARQL quads in the graph using the Resource Description Framework (RDF) N-QUADS language defined in the W3C RDF 1.1 N-Quads specification.
Gremlin _JSON Change Serialization Format

A Gremlin change record, contained in the data field of a log stream response, has the following fields:

  • id – String, required.
  • The ID of the Gremlin element.
  • type – String, required.The type of this Gremlin element. Must be one of the following:
    • vl – Vertex label.
    • vp – Vertex properties.
    • e – Edge, and also edge label.
    • ep – Edge properties.
  • key – String, required. The property name. For element labels, this is “label”.
  • value – value object, required. This is a JSON object that contains a value field for the value itself, and a datatype field for the JSON data type of that value.
  • from – String, optional. If this is an edge (type=”e”), the ID of the corresponding from vertex.
  • to – String, optional. If this is an edge (type=”e”), the ID of the corresponding to vertex.

Data Warehouse Technology Options

There are several options available for building a data warehouse: Roworiented databases, column-oriented databases, and massively parallel processing architectures.

Row-Oriented Databases

Row-oriented databases typically store whole rows in a physical block. High performance for read operations is achieved through secondary indexes. Databases such as Oracle Database Server, Microsoft SQL Server, MySQL, and PostgreSQL are row-oriented database systems. These systems have been traditionally used for data warehousing, but they are better suited for transactional processing (OLTP) than for analytics.

  • To optimize performance of a row-based system used as a data warehouse, developers use a number of techniques, including building materialized views, creating pre-aggregated rollup tables, building indexes on every possible predicate combination, implementing data partitioning to leverage partition pruning by query optimizer, and performing index based joins.
  • Traditional row-based data stores are limited by the resources available on a single machine. Data marts alleviate the problem to an extent by using functional sharding. Users can split the data warehouse into multiple data marts, each satisfying a specific functional area. However, when data marts grow large over time, data processing slows down.
  • In a row-based data warehouse, every query has to read through all of the columns for all of the rows in the blocks that satisfy the query predicate, including columns, which was not chosen. This approach creates a significant performance bottleneck in data warehouses, where users tables have more columns, but queries use only a few.
Column-Oriented Databases

 Column-oriented databases organize each column in its own set of physical blocks instead of packing the whole rows into a block. This functionality allows them to be more I/O efficient for read-only queries because they only have to read those columns accessed by a query from disk (or from memory). This approach makes column-oriented databases a better choice than row-oriented databases for data warehousing.

Row-Oriented vs. Column-Oriented Databases
Row-Oriented vs. Column-Oriented Databases

The Figure above illustrates the primary difference between row-oriented and column-oriented databases. Rows are packed into their own blocks in a roworiented database, and columns are packed into their own blocks in a columnoriented database.

  • After faster I/O, the next biggest benefit to using a column-oriented database is improved compression. Because every column is packed into its own set of blocks, every physical block contains the same data type. When all the data is the same data type, the database can use extremely efficient compression algorithms. As a result, you need less storage compared to a row-oriented database.
  • This approach results in significantly lesser I/O because the same data is stored in fewer blocks. Some column-oriented databases that are used for data warehousing include Amazon Redshift, Vertica, Teradata Aster, and Druid.
Massively Parallel Processing Architectures

An MPP architecture allows users to use all of the resources available in the cluster for processing data, thereby dramatically increasing performance of petabytescale data warehouses. MPP data warehouses allow users improve performance by simply adding more nodes to the cluster.

  • Amazon Redshift, Druid, Vertica, GreenPlum, and Teradata Aster are some of the data warehouses built on an MPP architecture. Open source frameworks such as Hadoop and Spark also support MPP.

Monitoring 

 

 

Amazon Neptune supports various methods for monitoring performance and usage:

  • Instance status – Check the health of a Neptune cluster’s graph database engine, find out what version of the engine is installed, and obtain other instance-related information using the instance status API.
  • Amazon CloudWatch – Neptune automatically sends metrics to CloudWatch and also supports CloudWatch Alarms
  • Audit log files – View, download, or watch database log files using the Neptune console. 
  • Publishing logs to Amazon CloudWatch Logs – Users can configure a Neptune DB cluster to publish audit log data to a log group in Amazon CloudWatch Logs. With CloudWatch Logs, users can perform real-time analysis of the log data, use CloudWatch to create alarms and view metrics, and use CloudWatch Logs to store the log records in highly durable storage. 
  • AWS CloudTrail – Neptune supports API logging using CloudTrail
  • Event notification subscriptions – Subscribe to Neptune events to stay informed about what is happening. 
  • Tagging – Use tags to add metadata to the Neptune resources and track usage based on tags

Amazon Neptune and Amazon CloudWatch are integrated so that users can gather and analyze performance metrics. Users can monitor these metrics using the CloudWatch console, the AWS Command Line Interface (AWS CLI), or the CloudWatch API.

  • CloudWatch also lets users set alarms so that users can be notified if a metric value breaches a threshold that was specified. Users can even set up CloudWatch Events to take corrective action if a breach occurs. For more information about using CloudWatch and alarms.

Log files are in UTF-8 format. Logs are written in multiple files, the number of which varies based on the instance size. To see the latest events, users might have to review all the audit log files.

  • Log entries are not in sequential order. You can use the timestamp value for ordering them.
  • Log files are rotated when they reach 100 MB in aggregate. This limit is not configurable.
  • To audit Amazon Neptune DB cluster activity, enable the collection of audit logs by setting a DB cluster parameter. When audit logs are enabled, users can use it to log any combination of supported events. Users can view or download the audit logs to review them.

Amazon Neptune is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in Neptune. CloudTrail captures API calls for Neptune as events, including calls from the Neptune console and from code calls to the Neptune APIs. CloudTrail only logs events for Neptune Management API calls, such as creating an instance or cluster. 

  • CloudTrail is enabled on the AWS account when you create the account. When activity occurs in Amazon Neptune, that activity is recorded in a CloudTrail event along with other AWS service events in Event history. Users can view, search, and download recent events in the AWS account. 
  • For an ongoing record of events in the AWS account, including events for Neptune, create a trail. A trail enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when creating a trail in the console, the trail applies to all Regions. The trail logs events from all Regions in the AWS partition and delivers the log files to the Amazon S3 bucket that was specified. Additionally, users can configure other AWS services to further analyze and act upon the event data collected in CloudTrail logs.

Amazon Neptune uses Amazon Simple Notification Service (Amazon SNS) to provide notifications when a Neptune event occurs. These notifications can be in any form that is supported by Amazon SNS for an AWS Region, such as an email, a text message, or a call to an HTTP endpoint.

  • Neptune groups these events into categories that users can subscribe to so that users can be notified when an event in that category occurs. Users can subscribe to an event category for a DB instance, DB cluster, DB snapshot, DB cluster snapshot, or for a DB parameter group. For example, when subscribing to the Backup category for a given DB instance, you are notified whenever a backup-related event occurs that affects the DB instance. users also receive notification when an event notification subscription changes.
 

Amazon Neptune pricing

Amazon Neptune
On-Demand Instance Pricing

On-Demand Instances let you pay for your database by the hour with no long-term commitments or upfront fees. This frees users from the cost and complexity of planning and purchasing database capacity ahead of needs. On-Demand pricing lets users pay as you go and is ideal for development, test and other short-lived workloads.

  • Instance pricing applies to both primary instances, used for read-write workloads, and Amazon Neptune replicas, used to scale reads and enhance failover. Neptune uses Multi-AZ architecture to failover to one of users replicas if an outage occurs. The cost of Multi-AZ deployments is simply the cost of the primary instance plus the cost of each Neptune replica. To maximize availability, we recommend placing at least one replica in a different Availability Zone from the primary instance.
Database Storage and IOs

Storage consumed by your Amazon Neptune database is billed in per GB-month increments and IOs consumed are billed in per million request increments. Users pay only for the storage and IOs the Neptune database consumes and do not need to provision in advance.

Backup storage

Backup storage for Amazon Neptune is the storage associated with automated database backups and any customer-initiated database cluster snapshots. Increasing the backup retention period or taking database cluster snapshots increases the backup storage consumed.

  • Backup storage is allocated by region. Total backup storage space is equivalent to the sum of the storage for all backups in that region.
  • Moving a database snapshot to another region increases allocated backup storage in the destination region.
  • There is no additional charge for backup storage of up to 100% of the total Neptune database storage for each Neptune database cluster. There is also no additional charge for backup storage if your backup retention period is 1 day and don’t have any snapshots beyond the retention period.
  • Backup storage as well as snapshots you store after the  database cluster is deleted will be charged at the above rates.

Amazon Neptune is a fast, reliable, and fully managed graph database service. Amazon Neptune is to build and run applications that work with highly connected datasets. The core of Neptune is a purpose-built, high-performance graph database engine. This engine is optimized for storing billions of relationships and querying the graph with milliseconds latency. Supporting both the Apache Tinkerpop Gremlin stack as well as the RDF 1.1 / SPARQL 1.1 W3C standards, Amazon Neptune efficiently stores and navigates highly connected data, allowing developers to create interactive graph applications that can query billions of relationships with millisecond latency.

As a purpose-built cloud service, Amazon Neptune is seamlessly embedded into the Amazon Web Service (AWS) ecosystem and comes with a broad set of enterprise features including SDKs for deployment and configuration, high availability and scale-up using replication, automatic backup and restore functionality, point in time recovery, monitoring, encryption-at-rest, security using VPCs and integrated access management, as well as audit logs. Neptune is highly available, with read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across Availability Zones. Amazon Neptune provides data security features, with support for encryption at rest and in transit. Neptune is fully managed:

  • Primary DB instance – Supports read and write operations, and performs all of the data modifications to the cluster volume. Each Neptune DB cluster has one primary DB instance that is responsible for writing (that is, loading or modifying) graph database contents.
  • Neptune replica – Connects to the same storage volume as the primary DB instance and supports only read operations. Each Neptune DB cluster can have up to 15 Neptune Replicas in addition to the primary DB instance. This provides high availability by locating Amazon Neptune Replicas in separate Availability Zones and distribution load from reading clients.
  • Cluster volume – Amazon Neptune data is stored in the cluster volume, which is designed for reliability and high availability. A cluster volume consists of copies of the data across multiple Availability Zones in a single AWS Region. Because data is automatically replicated across Availability Zones, it is highly durable, and there is little possibility of data loss.