This dataset contains English word embeddings pre-trained on biomedical texts from MEDLINE®/PubMed® using gensim's Word2Vec implementation. The embeddings of this dataset are an improved version of the Word2Vec embeddings we released in 2014 (http://bioasq.lip6.fr/info/BioASQword2vec/) in the context of the BioASQ challenge (http://www.bioasq.org/). Two versions of word embeddings are provided, both in Word2Vec's C binary format: 200-dimensional embeddings: file pubmed2018_w2v_200D.bin 400-dimensional embeddings: file pubmed2018_w2v_400D.bin In both versions, the vocabulary size is 2,665,547 types (distinct words). Additional technical information: - Papers and code of Mikolov et al.'s original Word2Vec: https://code.google.com/archive/p/word2vec/ https://arxiv.org/pdf/1301.3781.pdf https://arxiv.org/pdf/1310.4546.pdf https://www.aclweb.org/anthology/N13-1090 - Word2Vec implementation used: gensim's Word2Vec (version 3.3.0). https://radimrehurek.com/gensim/models/word2vec.html - Corpus used: MEDLINE/PubMed Baseline Repository 2018 (January 2018). https://www.nlm.nih.gov/databases/download/pubmed_medline.html - Preprocessing: Step 1: From the XML files of the MEDLINE/PubMed Baseline Repository, we extracted and used only the title and abstract of each article. Step 2: All the text fields of the abstracts were split into sentences using the sentence splitter (sent_tokenize) of NLTK (version 3.2.3). http://www.nltk.org/api/nltk.tokenize.html Step 3: All the titles and all the sentences of the abstracts were preprocessed and tokenized using the "bioclean" function, which is included in the toolkit.py script that accompanies the word embeddings of the BioASQ challenge. Step 4: gensim's Word2Vec implementation (skip-gram model) was then applied to the preprocessed and tokenized titles and sentences of the abstracts. - Data statistics: Number of articles: 27,836,723 Number of articles with title and abstract: 17,730,230 Number of articles with title only (no abstract): 10,106,493 Number of titles and sentences: 173,755,513 Number of tokens: 3,580,134,037 Average sentence length (treating titles as sentences): 20.6 tokens - Word2Vec settings used: min_count=5 (minimum corpus frequency) sg=1 (use skip-gram) hs=0 (use negative sampling) size=200, 400 (embedding dimensions) window=5 (maximum distance between current and predicted word) workers=20 All other parameters were set to the defaults they have in gensim version 3.3.0. Terms and conditions: This dataset (word embeddings) was produced from a dataset (the corpus described above) provided by the National Library of Medicine (NLM). The following Terms and Conditions apply to NLM data: https://www.nlm.nih.gov/databases/download/terms_and_conditions.html This dataset does not reflect the most current/accurate data available from NLM. This dataset was produced and is provided by the Natural Language Processing Group of the Department of Informatics, Athens University of Economics and Business, Greece (http://nlp.cs.aueb.gr/) with a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license. https://creativecommons.org/licenses/by-nc-sa/4.0/ https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode If you use this dataset or part of it, please cite the following paper: R. McDonald, G. Brokos and I. Androutsopoulos, "Deep Relevance Ranking Using Enhanced Document-Query Interactions". Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium, 2018. George Brokos and Ion Androutsopoulos August 20, 2018