- Phishing: a form of social engineering where hackers persuade people to reveal sensitive information and to install malwares.
- Skimming: the use of a small electronic device, skimmer, to store victims’ card numbers.
- Identity fraud: Unauthorized use of other people’s personal and financial information.
- Refund frauds: A type of payment fraud in which an individual or a group of people falsely claim a refund or reimbursement from a company, government, or financial institution.
- General Data Protection Regulation (GDPR)
- Payment Card Industry Data Security Standard (PCI DSS)
- Anomaly detection: Identifying unusual patterns and deviations from normal transaction behavior. This includes detecting transactions with unusual amounts, unusual transaction times, or transactions from unusual locations.
- Network analysis: It uses graph analysis to detect unusual relationships between financial entities, helping uncover criminal organization activities and networks of fraudulent actors. This involves analyzing transaction networks to identify clusters of connected entities engaging in suspicious behavior.
- Identity verification: It uses machine learning to verify user identity information, such as recognition documents or facial recognition data. This includes techniques like document authentication, biometric verification, and behavioral analysis to ensure the authenticity of user identities.
- Text analysis: It uses unsupervised learning to analyze unstructured data, such as email and message text, to detect keywords indicating fraud. Natural language processing techniques are used to extract meaningful information from text data, such as identifying phishing attempts or fraudulent communications.
- Risk rating: It utilizes machine learning algorithms like logistic regression to assign a risk level to transactions. Transactions exceeding a certain risk threshold are flagged as potentially fraudulent and subjected to human review. Other techniques, such as decision trees or random forests, are also used for risk assessment.
Logistic Regression
A popular way to build a fraud detection system is through Logistic Regression. This is a popular classification algorithm, used to predict a binary outcome based on a set of independent variables. In this case, the variables are all the information about the payment such as the amount, time of the day, location, purchased good and others, which are evaluated to predict the odds that the payment is a fraud. We will now see the steps involved in building an algorithm of this sort, from cleaning the training database to testing and evaluation.Data Cleaning and database division
The first phase consists in selecting a database to train your model on. We chose a database on transactions made using credit cards by European cardholders during two days of September 2013. Since frauds represent a minority of transactions the data is very unbalanced, over the whole sample of 284,807 transactions only 492 are fraudulent (0.172%). [data available here] Since the data is often noisy, we must clean it and prepare it to train the classifier. During this procedure we must fill in possible missing values and remove any outliers, failing to do so could lead to poor model performance or errors in the training phase. First, we can fill the missing values with the mean of the considered category. Then, outliers are removed with a clustering method, dividing the dataset in three bins: two for higher and lower outliers and one for the rest. The first two are then eliminated. At last the database is divided into training and testing samples. The goal of the training database is to construct the classifier (model), while the goal of the testing database is to test (evaluate) the built classifier. In this work, the cross-validation method is used to divide the database:As shown in the figure, the database is divided into 10 parts. In the first iteration (𝑘 = 1), the first nine parts are considered a training set, while the last part of the database is considered a testing set. In the second iteration (𝑘 = 2), both the first eight parts and the tenth part are considered as a training set, while the ninth part of the database is considered a testing set. This process continues until the last iteration (𝑘 = 10), where the first part is the testing set and the last nine parts are the training set.
Building the classifier
In the context of building the classifier, logistic regression is preferred over linear regression because it allows classification of more complex data. Logistic regression takes the data as input (interpreted as variables), estimates the probability that it belongs to the fraud category or not, and returns the odds of it actually being a fraud as output, call it y. The mathematical steps to obtain the logistic equations from linear regression are given below: The equation of the straight line can be written as: $$y=a_0+a_1\times x_1+a_2\times x_2+\dots a_k\times x_k$$ Where \(x_1,x_2,\dots\) are the variables and \(a_0,a_1,a_2,\dots\) the coefficients that we are going to estimate during training. In logistic regression, \(y\) can be between 0 and 1 only, so we divide the above equation by \((1 − 𝑦)\) to obtain the odds, defined as the ratio of favorable to unfavorable outcomes. It evaluates as: $$\frac{y}{1-y}|0\text{ for } y=0\text{ and }\infty\text{ for } y=1$$ As a result, the logistic regression equation is defined as: $$\log\left(\frac{y}{1-y}\right)=a_0+a_1\times x_1 + a_2 \times x_2 +\dots a_k \times x_k$$ Now the logit can take values from positive to negative infinity. To transform it back into a probability, which must be between 0 and 1, we apply the inverse logit function, also known as the sigmoid function: $$y=\frac{1}{1+e^{-(a_0+a_1x_1+\dots +a_nx_n)}}$$ In other words, the fraud class takes the value “1”, while the non-fraud class takes the value “0”. A threshold of 0.5 is used to differentiate between the two classes.Testing and evaluating
Since the cross-validation method divides the database into 10 parts, there are 10 testing data sets. To determine accuracy, sensitivity and error rate of each test we rely on a confusion matrix. The confusion matrix is formed based on the following terms: true positives, true negatives, false positives and false negatives.By accuracy we define the percentage of records in the test set that are correctly classified (fraudulent or non-fraudulent). The final accuracy of the trained classifier is defined as the average accuracy of the model on the ten datasets.
If the model performs according to the target metrics, it can then be deployed in real financial applications, otherwise, further training will be required.
Sources:
- https://www.ravelin.com/insights/machine-learning-for-fraud-detection
- https://www.nomentia.com/blog/ai-machine-learning-in-fraud-detection
- https://www.itransition.com/machine-learning/fraud-detection
- https://thesai.org/Downloads/Volume11No12/Paper_65-Fraud_Detection_in_Credit_Cards.pdf
Authors: Matteo Mello, Riccardo Scibetta