Presentation Information
[1K3-GS-3a-04]Dataset Similarity Learning via Multi-View Fusion of Metadata and Tabular Data Sampling
〇Haoyang Cheng1, Teruaki Hayashi1 (1. Univ. of Tokyo)
Keywords:
Dataset Similarity,Representation Learning,Dataset Embedding
On modern data platforms, users discover datasets through metadata and small content previews. This study proposes a multi-view framework for learning dataset–dataset similarity that decomposes metadata into three complementary views—Tag, Text, and Behavior—and augments them with a Content view derived from sampled main tables. The Tag and Text views model dataset–tag/word bipartite graphs via type-constrained random walks embedded with Skip-gram (SGNS). The Behavior view captures functional proximity from creator and user interaction signals. The Content view summarizes tables using compact column sketches from sampled rows and columns, embedded via a sentence-transformer model. These four view-specific similarity graphs are integrated through a reliability-aware extension of Similarity Network Fusion, which adaptively weights views per dataset and iteratively refines neighborhood structures. Experiments on Meta Kaggle Datasets (up to ~100K datasets), evaluated with proxy ground-truth signals, demonstrate that four-view fusion consistently outperforms single-view and naïve fusion baselines across standard ranking metrics, remaining robust under partial content availability and strict sampling constraints.
