Arcade uses ETL to convert the Paradise Papers from Neo4j to OrientDB

Paradise on OrientDB

At Arcade we love data, but sometimes data doesn’t love us. Data comes in various formats and these formats are not always compatible. To help us with that problem we use some custom ETL (Extract, Transform and Load). In this article you can see exactly how we use this process at Arcade!

Here at Arcade we are required to use the ETL process from time to time. To showcase our product we use multiple pre-built databases. For our most recent use case we installed the Neo4j Paradise Papers database. We decided we wanted to transfer this dataset to OrientDB, but we found that this database lacked an OrientDB port. Without a port we had two options: use the OrientDB-Neo4j Importer, able to migrate a Neo4j database to OrientDB, or start from scratch with the CSVs and use ETL.

We choose the second path and want to share with you exactly how we did it!

 

The Sprite

When writing some ETL from scratch a flexible data container is a must. The most flexible data container is a map (e.g. dictionary, associative array), but a map has no logic. Sprite was born to overcome this kind of problem and enables easy and fast data manipulation. It is a map that allows multiple values for each key. It offers methods to rename keys, copy values, apply functions to values, merge multiple values to single ones, and split a single value into more.

Below is a very simple example of what Sprite offers:

We began our ETL process by loading the Paradise Papers CSV files.

Unzip the archive, and explore the files:

The data set is composed of nodes and edges, with a header on the first line. Here are the headers for the other nodes file:

The edge file contains relations between nodes present in other files:

 

Data model

The data model is exactly the same you have in neo4j (https://neo4j.com/blog/depth-graph-analysis-paradise-papers/). The main difference is that a minimal schema is defined upfront:

The schema defines a base vertex class Paradise with a single property node_id. A unique index is then created on this property. The index is useful to speedup the lookups when creating edges linking vertices together.

Before loading the CSVs, the process creates the given schema.

 

Load!

Kotlin offers extension functions, a way to add methods to classes. Diving into a directory to get only some files is a one line statement:

The loadVertices function opens a file reader and uses apache commons-csv to obtain a sequence of lines:

Each line is loaded into a Map, and Sprite offers a way to be loaded from a map:

Once we have a single line loaded inside a Sprite, we can clean it and produce the OSQL line to create that vertex in OrientDB:

This process cleans some fields up:

What we are doing is removing noise from field content. The usual content for the field valid_until is:

“Aruba corporate registry data is current through 2016”

The cleanup function transform it to

“2016”

The same happens to sourceID, from:

“Paradise Papers – Aruba corporate registry”

To

“Aruba”

The last function is applied to all the fields using a regular expression to match all field names and apply a function to the values to escape content.

After the cleanup, a string containing the OSQL command is created and saved in sprite itself:

Here we are using the apply{} from Kotlin standard library: this is the sprite instance. First  the process extracts entries from the Sprite, by filtering empty values and the className field.

The setter string is generated from the set of filtered entries and used to complete the OSQL command.

 

Explore

Thanks to ETL, Arcade’s demo offers the opportunity to explore the Paradise Papers database in the original Neo4j flavour and the OrientDB one. Click here to try it out for yourself!