What is Fraud detection
With Fraud Detection we mean the process of identifying actual or expected frauds within an organisation.
Telephone companies, insurance companies, banks and e-commerce platforms are examples of industries that use massive data analysis techniques to prevent frauds.
In this scenario for every organisation there is a big challenge to face: being good at detecting known types of traditional frauds, through the searching of well-known patterns, and being good to uncover new patterns and frauds.
We usually can categorise fraud detection according to the following aspects:
- Proactive and Reactive
- Manual and Automate
Why Fraud Detection is important
According to an economic crime survey performed by PwC in 2018, fraud is a billion-dollar business and it is increasing every year: half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind.
Most of the frauds involves cell phones, tax return claims, insurance claims, credit cards, supply chains, retail networks, purchase dependencies and turn out a big problem both for governments and businesses.
Investing in fraud detection can take the following benefits:
- Promptly react to fraudulent activities
- Reduce exposure to fraudulent activities
- Reduce economic damages caused by frauds
- Recognise the vulnerable accounts more exposed to frauds
- Increase trust and confidence of the shareholders of the organisation
A good fraudster can workaround the basic fraud detection techniques, for this reason developing new detecting strategies it is very important for any organisation and fraud detection must be considered a complex and always-evolving process.
Phases and Techniques
Fraud detection process starts with a high level data overview with the goal of discovering some anomalies and suspicious behaviours inside the dataset, e.g. we could be interested in looking for weird credit card purchasing. Once we have found the anomalies we have to recognise their origin, because each of them could be due to frauds, but also to errors in the dataset or just missing data.
This fundamental step is called data validation, and it consists in errors detection, followed by incorrect data correction and missing data filling up.
Once data was cleaned up the real phase of data analysis can start; after the analysis is completed all the results must be validated, reported and graphically presented.
To recap, the main steps in the detection process are the following:
- Data collection
- Data preparation
- Data analysis
- Report and presentation of results
Arcade Analytics fits very well for the last steps, as it is a tool conceived to create captivating and effective reports that allows to share in a very easy way the results of a specific analysis by composing different widgets in complex dashboards.
The main widget is the Graph Widget, it allows users to visually see relationships and connections within their datasets and find meaningful connections and relationships. Moreover all the widgets present in the same dashboard can be connected in order to make them interact with each other. In this way we will be able to see in the outcoming dashboard bidirectional interactions between the graphs, data tables and the traditional charts widgets.
The chart distributions will be computed according to the partial datasets of the correspondent primary widgets, making the final report dynamic and interactive.
But that is not at all, Arcade can be useful for several techniques in the data analysis steps. Let’s see how then!
Data analysis often relies on automated processes exploiting statistical methods and artificial intelligence techniques, that are commonly classified respectively in supervised and unsupervised techniques.
Among the statistical methods we can found:
- Data processing
- Statistical parameters calculation, relevant for the specific domain
- Models and probability distributions
- Time series analysis
- Clustering and classification of entities, in order to find associations and patterns among data.
Arcade offers several tools to perform single and multi series analyses exploiting an efficient full text search engine and inverted indices, that assure good performances in computing statistical parameters and distributions on the whole datasource.
Credit Cards’ type distribution
Global Transactions/Orders type distribution
This methods are good to identify statistical classification and infer rules: these rules can be then used to define rule-based classifiers, supervised learning algorithms that use rules of If (fulfills certain conditions) and Then (appropriate category).
Moreover Arcade offers good support to time series analysis: by using the timeline feature you can see your data in the form of graph and how it changes over the time.
In this way we can analyse when the relationships between specific items or entities appeared by exploiting a time filtering window, to narrow your temporal analysis to a specific and customisable range.
Then you can interact with this analysis by zooming in and out: by changing the grain you can perform a simple temporal top down analysis starting from a wider view, useful to see at a first glance how the events are distributed over the time, till reaching each single event if needed.
Obviously in each perspective you can move back and forth through time.
Beside the statistical methods, it could be helpful put beside automated processes like:
- Data mining
- Patterns recognition
- Machine learning and prediction to implement proactive rules
In fact these unsupervised methods do not require samples of fraudulent transactions, so they turn out useful in all that scenarios where there is no a prior knowledge of classes of transactions or when we want to extend these categories in order to recognise previously undiscovered frauds, not yet present in historical databases.
Importance of human interaction
Often in this scenarios we can encounter the Fraud Analytics concept, that is commonly conceived as a combination of automated analytics technologies and analytics techniques with human interaction. In fact we cannot get rid of domain experts users’ interaction mainly for two reason:
- Problem of high number of false positives: not all the transactions detected as fraudulent are actually frauds. Generally detection systems based on the best algorithms result in too many false positives, even though they are able to identify a high percentage of the actual fraudulent transaction (till 99 %). Thus, all the results must be validated in order to exclude the false positives from the first result.
- High computing time complexity of the algorithms, above all in prediction scenarios: when algorithm time complexity is exponential, monolithic execution is not a good approach, because it could require a lot of time for big inputs. Thus a progressive approach is adopted, consisting in decreasing computational requested time by combining specific resolution models and automated calculation with human interaction, said designer.
Intermediate results are proposed to the designer during the computation, thus he decides which way the analysis has to take in a progressive manner. In this way whole execution branches can be omitted, achieving a good gain in terms of the performance.
For both these two aims a visual tool is needed, Arcade Analytics turns out very appropriate for these tasks thanks to the features already shown and the expressive power of the graph model.
How Graph perspective can help
Graph perspective can be very useful in fraud detection use case, because as already said most of the computation relies on patterns recognition. Then we can use these patterns to find out and retrieve all the unusual behaviours we are looking for, without needing to write complex join queries.
Specifically Arcade offers support to different graph querying languages based on:
- the pattern matching approach: Cypher query language proposed by Neo4j and the MATCH statement of the OrientDB SQL query language are fully supported in Arcade nowadays. This is the winning approach when we can rely on several patterns to detect frauds.
- the graph traversal approach, that makes very simple walk the graph to explore information of actual interest. Gremlin is a good example of these kind of languages.
Moreover one of the most attractive features of Arcade Analytics is that it allows users to query data from a relational database to easily visualise the data therein as a graph and explore the connections inside it without any migration and through few simple steps.
Now let’s have a look to a very common pattern recognised as potential fraudulent activity that is often missed by traditional fraud detection systems, and in what way Arcade Analytics can help us in the analysis of all the instances matching this specific schema.
First-party fraud detection
First of all let’s define what a first-party fraud detection is. From definitions.uslegal.com we get the following definition:
“First party fraud refers to fraud that is committed by an individual or group of individuals on their own account by opening an account with no intention of repayment. A first party fraud applicant uses synthetic identification or they generally misrepresent their real identity by lying to creditors on application forms, or by using false or proxy addresses. A first party fraud is different from a third party fraud or identity fraud because in third party fraud the perpetrator of fraud uses another person’s identifying information. First party fraud includes advances fraud, bust out fraud, friendly fraud, application fraud, and sleeper fraud.”
In last years the number of third-party fraud, based on identity theft, are decreasing meanwhile are growing cases of misrepresented identities and false personal information.
We can suppose a lot of different scenarios that can be categorised as first-party fraud detection, from the simplest ones till the most complicated.
The following can be a simple example:
John Smith opens a new credit card account, maxes out his credit line, defaults and then disappears without leave any trace.
In this scenario Mr. Smith used his own credentials, with minor variations in his contact data, to deliberately defraud the credit card company.
But we can also encounter group of two or more people organised into a fraud ring, where there is a subset of legitimate contact information, like telephone numbers and addresses.
This data is combined to create several synthetic identities that will be used by the ring members to open fraudulent accounts. Thus this new accounts have access to credit lines, credit cards, overdraft protection, personal loans, etc.
The accounts are used normally, with regular purchases and timely payments, and for this reason banks increase the revolving credit lines over time, due to the observed responsible credit behaviour.
One day the accounts coordinate their activity, maxing out all of their credit lines and disappearing.
Then this ring schema can be detected as suspicious behaviour and if recognised and validated in time can avoid big losses.
Here is a sample graph instance in Arcade matching this ring pattern:
This pattern can be looked for in the data through a specific query, loaded in Arcade Analytics and deeply investigated by human experts in order to prevent this kind of fraud.
To conclude, we can state that Arcade Analytics can take a valid contribution in a complex fraud detection system by covering different roles in the whole process.
I hope this post was helpful and interesting, stay tuned!