Amazon Neptune, Hands On

aws-neptune_1-768x380

 

Arcade is proud to be the first graph visualization tool able to connect to Amazon Neptune!

This article shows how to load a sample dataset on our Amazon Neptune graph database. Then connects Arcade to Neptune and dive into the dataset. The dataset is the Rolling Stone’s top 500 albums manipulated to be loaded on a graph database.

Amazon Neptune

What’s Neptune? From the website:

“Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Amazon Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency. Amazon Neptune supports popular graph models Property Graph and W3C’s RDF, and their respective query languages Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune powers graph use cases such as recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security.”

For the time being, Arcade is using the Apache TinkerPop Gremlin language to interact with Neptune. In this post we will show how we created a Neptune cluster following the documentation provided by Amazon, loaded a dataset and connected Arcade to discover insights.

Setup the cluster

To create a Neptune cluster, just follow the documentation to setup it and configure the security to be able to connect from a micro EC2 instance. The starting point is the getting started guide inside the official documentation. At the end of the setup process what you have is:


As a short todo list:

  1. Launch a cluster
  2. Launch EC2
  3. Connect to the console of your EC2 instance and test the connection to the cluster:

Load data

Loading data inside Amazon Neptune could be tricky. A good place to start is the AWS sample repository on Github, where the main steps are depicted:

  1. Create one or more files using the CSV format
  2. Upload on an S3 bucket on the same VPC
  3. Call the load entry point on Neptune

In the original dataset, each line contains the number of the album, the year of publication, the title, the name of the artist, the genres and the sub-genres.

Number,Year,Album,Artist,Genre,Subgenre
1,1967,Sgt. Pepper's Lonely Hearts Club Band,The Beatles,Rock,"Rock & Roll, Psychedelic Rock"
2,1966,Pet Sounds,The Beach Boys,Rock,"Pop Rock, Psychedelic Rock"
3,1966,Revolver,The Beatles,Rock,"Psychedelic Rock, Pop Rock"
4,1965,Highway 61 Revisited,Bob Dylan,Rock,"Folk Rock, Blues Rock"

First of all, the file was parsed to extract distinct years, artists, genres and sub-genres.

The output format is the Gremlin load data format.

The id is generated while parsing and splitting the original file and it is composed by a prefix, derived from the file,  and the content. Years have y_<YEAR> as id, genres have s_<Genre name> and so on.

Year.csv:

~id,~label,name
y_1967,year,1967
y_1966,year,1966
Y_1965,year,1965
...

Artist.csv:

~id,~label,name
a_the_beatles,artist,The Beatles
a_the_beach_boys,artist,The Beach Boys
a_bob_dylan,artist,Bob Dylan
...

Genre:

~id,~label,name
g_rock,genre,Rock
g_pop,genre,Pop
g_funk_/_soul,genre,Funk / Soul
...

SubGnere.csv

~id,~label,name
s_rock_&_roll,subgenre,Rock & Roll
s_psychedelic_rock,subgenre,Psychedelic Rock
s_pop_rock,subgenre,Pop Rock
s_folk_rock,subgenre,Folk Rock
....

Then there’s the albums file, where the year is kept as property value

~id,~label,title,year,number
a_1,album,Sgt. Pepper's Lonely Hearts Club Band,1967,1
a_2,album,Pet Sounds,1966,2
a_3,album,Revolver,1966,3
a_4,album,Highway 61 Revisited,1965,4
....

The Edges file contains the relations from each album to each node:

~id,~from,~to,~label
1_1967,a_1,y_1967,hasYear
1_athe_beatles,a_1,a_the_beatles,hasArtist
1_grock,a_1,g_rock,hasGenre
1_srock_&_roll,a_1,s_rock_&_roll,hasSubgenre
1_spsychedelic_rock,a_1,s_psychedelic_rock,hasSubgenre
2_1966,a_2,y_1966,hasYear
2_athe_beach_boys,a_2,a_the_beach_boys,hasArtist
2_grock,a_2,g_rock,hasGenre
2_spop_rock,a_2,s_pop_rock,hasSubgenre
2_spsychedelic_rock,a_2,s_psychedelic_rock,hasSubgenre
3_1966,a_3,y_1966,hasYear
3_athe_beatles,a_3,a_the_beatles,hasArtist
3_grock,a_3,g_rock,hasGenre
3_spsychedelic_rock,a_3,s_psychedelic_rock,hasSubgenre
3_spop_rock,a_3,s_pop_rock,hasSubgenre
...

The files are available on data.world in our area.

Once the files are ready, Neptune can load them from an S3 bucket. These operations are executed from the EC2 instance

Upload the files to an S3 bucket on the same region and use the load api:


Neptune will load the files from your bucket. The loadId can be used to check the status of the load process:

Let’s execute a query just to count the number of vertices loaded:

Connect arcade

Arcade Analytics can be connected to Amazon Neptune through an ssh tunnel. Just add our public key to the authorized_keys file on the EC2 instance and configure the connection:

 

Play with data

You can easily play with this dataset on Arcade, just ask for a demo account. On the second dashboard of the demo user, there’s a widget connected to our Neptune cluster. The exploration started from the year 1970, traverse the hasYear relation to fetch all the albums published on that years

Then we can select all the albums and traverse the hasArtist relation

From this starting point, there are so many opportunities, In the image below, we selected Elton John, traversed the hasArtist incoming edges to find the other albums and then the years of each album. We did the same with Van Morrison and then expanded the hasYear relation to fetch the 21 albums published in 1968.

 

Request your demo account and start playing with your own database on Amazon Neptune.