Skip to content

Fraud Detection

Offer Compliance Classification Model

Overview

This document outlines the development and evaluation of a classification model designed to automate and streamline the manual compliance review process for offers on the "Pass Culture" platform. The model aims to predict whether an offer will be validated or rejected based on various offer attributes, thereby reducing the manual workload.

Context

Manually created offers on the "Pass Culture" platform are subjected to a compliance scoring script. Offers that trigger alerts require manual review, leading to a significant volume of offers needing analysis.

Data

Dataset

  • The dataset comprises offers that have undergone manual review
  • The dataset contains approximately 230,000 offers, with a class imbalance of ~90% validated and ~10% rejected.

Features

The following offer attributes were used as features:

  • Textual:
    • offer_name
    • offer_description
  • Categorical:
    • offer_subcategory_id
    • venue_department_code
  • Numerical:
    • stock
    • stock_price
  • Boolean:
    • outing
    • physical_goods
  • Other:
    • type
    • subType
    • rayon (product category)
    • macro_rayon (broader product category)
    • Offer image

Feature Engineering

  • Textual Features: Processed directly by CatBoost.
  • Categorical Features: One-hot encoded.
  • Numerical Features: Used as is.
  • Image Features: Embedded using a pre-trained model.

Model

Model Architecture

  • The model uses the CatBoostClassifier from the CatBoost library, which is well-suited for handling a mix of numerical, categorical, and textual features.

Training

  • The model was trained on the manually reviewed offer dataset.
  • Training temporality is to be reviewed for improvement.

Evaluation

  • The model was evaluated using a test set of 22,882 offers (20,187 validated, 2,695 rejected).

Metrics

  • Accuracy: 0.95
  • Balanced Accuracy: 0.81
  • Precision: 0.95
  • Recall: 0.99
  • Balanced Error Rate: 0.18

Confusion Matrix

  • Generated from the test set.

Feature Importance

  • SHAP Values: Used to estimate the contribution of each feature to the model's predictions.
  • Feature Analysis: Detailed analysis available in the associated notebook.
  • Probability Density: Visualization of the probability of offers being validated or rejected, enabling threshold optimization.