The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

2:25 PM - 2:40 PM JST(5:25 AM - 5:40 AM UTC)

[1K3-GS-3a-04]Dataset Similarity Learning via Multi-View Fusion of Metadata and Tabular Data Sampling

〇Haoyang Cheng¹, Teruaki Hayashi¹ (1. Univ. of Tokyo)

Keywords:

Dataset Similarity,Representation Learning,Dataset Embedding

On modern data platforms, users discover datasets through metadata and small content previews. This study proposes a multi-view framework for learning dataset–dataset similarity that decomposes metadata into three complementary views—Tag, Text, and Behavior—and augments them with a Content view derived from sampled main tables. The Tag and Text views model dataset–tag/word bipartite graphs via type-constrained random walks embedded with Skip-gram (SGNS). The Behavior view captures functional proximity from creator and user interaction signals. The Content view summarizes tables using compact column sketches from sampled rows and columns, embedded via a sentence-transformer model. These four view-specific similarity graphs are integrated through a reliability-aware extension of Similarity Network Fusion, which adaptively weights views per dataset and iteratively refines neighborhood structures. Experiments on Meta Kaggle Datasets (up to ~100K datasets), evaluated with proxy ground-truth signals, demonstrate that four-view fusion consistently outperforms single-view and naïve fusion baselines across standard ranking metrics, remaining robust under partial content availability and strict sampling constraints.

Comment

To browse or post comments, you must log in.Log in

Back to Session information