Presentation Information
[1Yin-A-05]A Study of End-to-End Learning for Doubly Robust Estimation in Large Action Spaces
〇Natsuki Fukano1, Tianxiang Yang1, Hideo Suzuki1 (1. Keio University)
Keywords:
Off-Policy Evaluation,Doubly Robust,High-Dimensional Data,Selection Bias,Counterfactuals
In marketing and operational decision-making, deploying untested policies directly to users involves significant risks, such as diminished customer satisfaction and safety concerns. To mitigate these, Off-Policy Evaluation (OPE) has been proposed to pre-estimate a new policy's performance using historical data. However, in environments with massive action spaces, such as e-commerce platforms, most actions are never recommended to a specific user. In these settings, traditional estimators like Inverse Propensity Scoring (IPS) and Doubly Robust (DR) suffer from extreme variance. While Marginalized IPS (MIPS) addresses this by utilizing action features for marginalization, it has been reported to exacerbate bias and require substantial operational costs for feature preparation and method selection. This research proposes a novel method that outperforms MIPS in accuracy with lower implementation overhead. By constructing a differentiable objective function, we achieve end-to-end learning via backpropagation. This approach enhances the accuracy of OPE in large-scale environments while making it more accessible for practical applications.
