Arcade is proud to be the first graph visualization tool able to connect to Amazon Neptune!
This article shows how to load a sample dataset on our Amazon Neptune graph database. Then connects Arcade to Neptune and dive into the dataset. The dataset is the Rolling Stone’s top 500 albums manipulated to be loaded on a graph database.
What’s Neptune? From the website:
“Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. The core of Amazon Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency. Amazon Neptune supports popular graph models Property Graph and W3C’s RDF, and their respective query languages Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune powers graph use cases such as recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security.”
For the time being, Arcade is using the Apache TinkerPop Gremlin language to interact with Neptune. In this post we will show how we created a Neptune cluster following the documentation provided by Amazon, loaded a dataset and connected Arcade to discover insights.
Setup the cluster
To create a Neptune cluster, just follow the documentation to setup it and configure the security to be able to connect from a micro EC2 instance. The starting point is the getting started guide inside the official documentation. At the end of the setup process what you have is:
As a short todo list:
- Launch a cluster
- Launch EC2
- Connect to the console of your EC2 instance and test the connection to the cluster:
Loading data inside Amazon Neptune could be tricky. A good place to start is the AWS sample repository on Github, where the main steps are depicted:
- Create one or more files using the CSV format
- Upload on an S3 bucket on the same VPC
- Call the load entry point on Neptune
In the original dataset, each line contains the number of the album, the year of publication, the title, the name of the artist, the genres and the sub-genres.
Number,Year,Album,Artist,Genre,Subgenre 1,1967,Sgt. Pepper's Lonely Hearts Club Band,The Beatles,Rock,"Rock & Roll, Psychedelic Rock" 2,1966,Pet Sounds,The Beach Boys,Rock,"Pop Rock, Psychedelic Rock" 3,1966,Revolver,The Beatles,Rock,"Psychedelic Rock, Pop Rock" 4,1965,Highway 61 Revisited,Bob Dylan,Rock,"Folk Rock, Blues Rock"
First of all, the file was parsed to extract distinct years, artists, genres and sub-genres.
The output format is the Gremlin load data format.
The id is generated while parsing and splitting the original file and it is composed by a prefix, derived from the file, and the content. Years have y_<YEAR> as id, genres have s_<Genre name> and so on.
~id,~label,name y_1967,year,1967 y_1966,year,1966 Y_1965,year,1965 ...
~id,~label,name a_the_beatles,artist,The Beatles a_the_beach_boys,artist,The Beach Boys a_bob_dylan,artist,Bob Dylan ...
~id,~label,name g_rock,genre,Rock g_pop,genre,Pop g_funk_/_soul,genre,Funk / Soul ...
~id,~label,name s_rock_&_roll,subgenre,Rock & Roll s_psychedelic_rock,subgenre,Psychedelic Rock s_pop_rock,subgenre,Pop Rock s_folk_rock,subgenre,Folk Rock ....
Then there’s the albums file, where the year is kept as property value
~id,~label,title,year,number a_1,album,Sgt. Pepper's Lonely Hearts Club Band,1967,1 a_2,album,Pet Sounds,1966,2 a_3,album,Revolver,1966,3 a_4,album,Highway 61 Revisited,1965,4 ....
The Edges file contains the relations from each album to each node:
~id,~from,~to,~label 1_1967,a_1,y_1967,hasYear 1_athe_beatles,a_1,a_the_beatles,hasArtist 1_grock,a_1,g_rock,hasGenre 1_srock_&_roll,a_1,s_rock_&_roll,hasSubgenre 1_spsychedelic_rock,a_1,s_psychedelic_rock,hasSubgenre 2_1966,a_2,y_1966,hasYear 2_athe_beach_boys,a_2,a_the_beach_boys,hasArtist 2_grock,a_2,g_rock,hasGenre 2_spop_rock,a_2,s_pop_rock,hasSubgenre 2_spsychedelic_rock,a_2,s_psychedelic_rock,hasSubgenre 3_1966,a_3,y_1966,hasYear 3_athe_beatles,a_3,a_the_beatles,hasArtist 3_grock,a_3,g_rock,hasGenre 3_spsychedelic_rock,a_3,s_psychedelic_rock,hasSubgenre 3_spop_rock,a_3,s_pop_rock,hasSubgenre ...
The files are available on data.world in our area.
Once the files are ready, Neptune can load them from an S3 bucket. These operations are executed from the EC2 instance
Upload the files to an S3 bucket on the same region and use the load api:
Neptune will load the files from your bucket. The loadId can be used to check the status of the load process:
Let’s execute a query just to count the number of vertices loaded:
Arcade Analytics can be connected to Amazon Neptune through an ssh tunnel. Just add our public key to the authorized_keys file on the EC2 instance and configure the connection:
Play with data
You can easily play with this dataset on Arcade, just ask for a demo account. On the second dashboard of the demo user, there’s a widget connected to our Neptune cluster. The exploration started from the year 1970, traverse the hasYear relation to fetch all the albums published on that years
Then we can select all the albums and traverse the hasArtist relation
From this starting point, there are so many opportunities, In the image below, we selected Elton John, traversed the hasArtist incoming edges to find the other albums and then the years of each album. We did the same with Van Morrison and then expanded the hasYear relation to fetch the 21 albums published in 1968.
Request your demo account and start playing with your own database on Amazon Neptune.