The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-A-05]A Study of End-to-End Learning for Doubly Robust Estimation in Large Action Spaces

〇Natsuki Fukano¹, Tianxiang Yang¹, Hideo Suzuki¹ (1. Keio University)

Keywords:

Off-Policy Evaluation,Doubly Robust,High-Dimensional Data,Selection Bias,Counterfactuals

In marketing and operational decision-making, deploying untested policies directly to users involves significant risks, such as diminished customer satisfaction and safety concerns. To mitigate these, Off-Policy Evaluation (OPE) has been proposed to pre-estimate a new policy's performance using historical data. However, in environments with massive action spaces, such as e-commerce platforms, most actions are never recommended to a specific user. In these settings, traditional estimators like Inverse Propensity Scoring (IPS) and Doubly Robust (DR) suffer from extreme variance. While Marginalized IPS (MIPS) addresses this by utilizing action features for marginalization, it has been reported to exacerbate bias and require substantial operational costs for feature preparation and method selection. This research proposes a novel method that outperforms MIPS in accuracy with lower implementation overhead. By constructing a differentiable objective function, we achieve end-to-end learning via backpropagation. This approach enhances the accuracy of OPE in large-scale environments while making it more accessible for practical applications.

Comment

To browse or post comments, you must log in.Log in

Back to Session information