Presentation Information

[C10-03]Development of a method for identifying rare antigen-specific T cell clones by integrating a protein language model and optimal transport

*Kyohei Kinoshita1, Tetsuya J Kobayashi2 (1. Graduate School of Engineering, the University of Tokyo (Japan), 2. Institute of Industrial Science, the University of Tokyo (Japan))

Keywords:

T cell receptor repertoire,Antigen specificity,Optimal transport,Protein language model

T cells play a crucial role in recognizing and eliminating diverse external threats, including unknown pathogens, by forming a diverse T cell receptor (TCR) repertoire. Developing methods to predict TCR antigen specificity is an important challenge for understanding immune responses and vaccine development. However, identifying antigen-specific TCRs, which exist at low frequencies (approximately one in a million) even after infection or vaccination, has been difficult both algorithmically and computationally. This study aimed to develop a new method to identify low-frequency antigen-specific TCRs with high speed and accuracy.

In this research, we developed an analytical pipeline combining TCR sequence embedding using a protein language model (PLM) and optimal transport theory. Using TCR data from COVID-19 infection (day 15) and recovery phase (day 85) [1], each TCR sequence was converted into a 64-dimensional vector using the PLM, and optimal transport was applied on a V gene basis. For the calculated transport costs indicating antigen specificity, smoothing and normalization of the costs were performed, and TCRs with high scores were selected as candidates for antigen specificity. The accuracy was evaluated by mapping against a SARS-CoV-2-specific TCR database.

The method developed in this study demonstrated high accuracy in identifying low-frequency antigen-specific TCRs with frequencies below 10, compared to existing frequency-based methods [2, 3] and similarity-based method [4]. This can be attributed to the tendency of antigen-specific TCRs to cluster at the edges of V gene clusters in the embedding space, suggesting that optimal transport captures these specific TCRs. Additionally, by optimizing the transport algorithms for big data, the computational cost was significantly reduced, enabling the identification of antigen-specific TCRs with frequencies of 3 or less, which are difficult to identify with existing methods.

This method enables the identification of low-frequency TCRs that were difficult to detect with conventional analysis methods. In the future, it is expected to contribute to a comprehensive understanding of antigen-specific T cell responses through applications to various infectious disease models and vaccination studies.


References:
[1] Minervina, A. A. et al. Elife 10, e63502 (2021).
[2] Robinson, M. D. et al. Bioinformatics 26, 139–140 (2010).
[3] Pogorelyy, M. V. et al. Proc. Natl. Acad. Sci. U. S. A. 115, 12704–12709 (2018).
[4] Olson, B. J. et al. PLoS Comput Biol 18, e1010681 (2022).