The increasing numbers of large-scale biobanks and institutional data networks have brought unique opportunities to link patientsÂ’ genomics, electronic health records, and survey data for studying complex human diseases, especially to address the diminished model performance in minority and disadvantaged groups due to their low representation in biomedical research. In this talk, I will introduce two statistical learning methods targeting underrepresented populations by integrating data from multiple biobanks, different ancestries, and related health outcomes. These methods protect data privacy by learning from pre-trained models in external data sources without sharing patient-level data and account for potential data heterogeneity. We provide theoretical guarantees for the model performance and insights regarding when the external model can be helpful to the target model. We demonstrate the superiority of our methods compared to benchmark methods, with examples using data from the UK biobank and the electronic Medical Records and Genomics (eMERGE) Network.