Show / Hide Table of Contents

    Download Pre-trained Word Vectors

    Oscova has an in-built Word Vector loader that can load Word Vectors from large vector data files generated by either GloVe, Word2Vec or fastText model.

    During development if you do not have a domain-specific data to train you can download any of the following pre-trained models. We have gathered the following from various external sources and are making the links available here. All the following word vector data in text file format should be compatible with Syn.Bot framework version 2.6.0 and above.

    Note

    The Word Vector loader in Oscova has been heavily optimized for performance and can load 10,000 Word Vectors per second or more depending on your machine configuration.

    FastText Models

    Below are pre-trained word vectors for 294 languages, trained on Wikipedia using fastText. A word vector model developed by Facebook research team. The vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters.

    Note

    We recommend loading these word vectors only during production and not during development as it may several minutes for entire vector data to be loaded and can hinder fast prototyping.

    All word vector data loaded in Oscova are utilized in both Standard and Deep processing mode.

    Format

    The word vectors below are in text formats of fastText, where each line contain a word followed by its embedding. Each value is space separated. Words are ordered by their frequency in a descending order.

    Models

    The models can be downloaded from:

    Abkhazian: Download Acehnese: Download Adyghe: Download
    Afar: Download Afrikaans: Download Akan: Download
    Albanian: Download Alemannic: Download Amharic: Download
    Anglo_Saxon: Download Arabic: Download Aragonese: Download
    Aramaic: Download Armenian: Download Aromanian: Download
    Assamese: Download Asturian: Download Avar: Download
    Aymara: Download Azerbaijani: Download Bambara: Download
    Banjar: Download Banyumasan: Download Bashkir: Download
    Basque: Download Bavarian: Download Belarusian: Download
    Bengali: Download Bihari: Download Bishnupriya Manipuri: Download
    Bislama: Download Bosnian: Download Breton: Download
    Buginese: Download Bulgarian: Download Burmese: Download
    Buryat: Download Cantonese: Download Catalan: Download
    Cebuano: Download Central Bicolano: Download Chamorro: Download
    Chavacano: Download Chechen: Download Cherokee: Download
    Cheyenne: Download Chichewa: Download Chinese: Download
    Choctaw: Download Chuvash: Download Classical Chinese: Download
    Cornish: Download Corsican: Download Cree: Download
    Crimean Tatar: Download Croatian: Download Czech: Download
    Danish: Download Divehi: Download Dutch: Download
    Dutch Low Saxon: Download Dzongkha: Download Eastern Punjabi: Download
    Egyptian Arabic: Download Emilian_Romagnol: Download English: Download
    Erzya: Download Esperanto: Download Estonian: Download
    Ewe: Download Extremaduran: Download Faroese: Download
    Fiji Hindi: Download Fijian: Download Finnish: Download
    Franco_Provençal: Download French: Download Friulian: Download
    Fula: Download Gagauz: Download Galician: Download
    Gan: Download Georgian: Download German: Download
    Gilaki: Download Goan Konkani: Download Gothic: Download
    Greek: Download Greenlandic: Download Guarani: Download
    Gujarati: Download Haitian: Download Hakka: Download
    Hausa: Download Hawaiian: Download Hebrew: Download
    Herero: Download Hill Mari: Download Hindi: Download
    Hiri Motu: Download Hungarian: Download Icelandic: Download
    Ido: Download Igbo: Download Ilokano: Download
    Indonesian: Download Interlingua: Download Interlingue: Download
    Inuktitut: Download Inupiak: Download Irish: Download
    Italian: Download Jamaican Patois: Download Japanese: Download
    Javanese: Download Kabardian: Download Kabyle: Download
    Kalmyk: Download Kannada: Download Kanuri: Download
    Kapampangan: Download Karachay_Balkar: Download Karakalpak: Download
    Kashmiri: Download Kashubian: Download Kazakh: Download
    Khmer: Download Kikuyu: Download Kinyarwanda: Download
    Kirghiz: Download Kirundi: Download Komi: Download
    Komi_Permyak: Download Kongo: Download Korean: Download
    Kuanyama: Download Kurdish (Kurmanji): Download Kurdish (Sorani): Download
    Ladino: Download Lak: Download Lao: Download
    Latgalian: Download Latin: Download Latvian: Download
    Lezgian: Download Ligurian: Download Limburgish: Download
    Lingala: Download Lithuanian: Download Livvi_Karelian: Download
    Lojban: Download Lombard: Download Low Saxon: Download
    Lower Sorbian: Download Luganda: Download Luxembourgish: Download
    Macedonian: Download Maithili: Download Malagasy: Download
    Malay: Download Malayalam: Download Maltese: Download
    Manx: Download Maori: Download Marathi: Download
    Marshallese: Download Mazandarani: Download Meadow Mari: Download
    Min Dong: Download Min Nan: Download Minangkabau: Download
    Mingrelian: Download Mirandese: Download Moksha: Download
    Moldovan: Download Mongolian: Download Muscogee: Download
    Nahuatl: Download Nauruan: Download Navajo: Download
    Ndonga: Download Neapolitan: Download Nepali: Download
    Newar: Download Norfolk: Download Norman: Download
    North Frisian: Download Northern Luri: Download Northern Sami: Download
    Northern Sotho: Download Norwegian (Bokmål): Download Norwegian (Nynorsk): Download
    Novial: Download Nuosu: Download Occitan: Download
    Old Church Slavonic: Download Oriya: Download Oromo: Download
    Ossetian: Download Palatinate German: Download Pali: Download
    Pangasinan: Download Papiamentu: Download Pashto: Download
    Pennsylvania German: Download Persian: Download Picard: Download
    Piedmontese: Download Polish: Download Pontic: Download
    Portuguese: Download Quechua: Download Ripuarian: Download
    Romani: Download Romanian: Download Romansh: Download
    Russian: Download Rusyn: Download Sakha: Download
    Samoan: Download Samogitian: Download Sango: Download
    Sanskrit: Download Sardinian: Download Saterland Frisian: Download
    Scots: Download Scottish Gaelic: Download Serbian: Download
    Serbo_Croatian: Download Sesotho: Download Shona: Download
    Sicilian: Download Silesian: Download Simple English: Download
    Sindhi: Download Sinhalese: Download Slovak: Download
    Slovenian: Download Somali: Download Southern Azerbaijani: Download
    Spanish: Download Sranan: Download Sundanese: Download
    Swahili: Download Swati: Download Swedish: Download
    Tagalog: Download Tahitian: Download Tajik: Download
    Tamil: Download Tarantino: Download Tatar: Download
    Telugu: Download Tetum: Download Thai: Download
    Tibetan: Download Tigrinya: Download Tok Pisin: Download
    Tongan: Download Tsonga: Download Tswana: Download
    Tulu: Download Tumbuka: Download Turkish: Download
    Turkmen: Download Tuvan: Download Twi: Download
    Udmurt: Download Ukrainian: Download Upper Sorbian: Download
    Urdu: Download Uyghur: Download Uzbek: Download
    Venda: Download Venetian: Download Vepsian: Download
    Vietnamese: Download Volapük: Download Võro: Download
    Walloon: Download Waray: Download Welsh: Download
    West Flemish: Download West Frisian: Download Western Punjabi: Download
    Wolof: Download Wu: Download Xhosa: Download
    Yiddish: Download Yoruba: Download Zazaki: Download
    Zeelandic: Download Zhuang: Download Zulu: Download

    References

    If you use these word embeddings, please cite the following paper:

    P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

    @article{bojanowski2016enriching,
      title={Enriching Word Vectors with Subword Information},
      author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
      journal={arXiv preprint arXiv:1607.04606},
      year={2016}
    }
    

    License

    The pre-trained word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

    Word2Vec and GloVe Models

    Model file Number of dimensions Corpus (size) Vocabulary size Author Architecture
    Google News 300 Google News (100B) 3M Google word2vec
    Freebase IDs 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram
    Freebase names 1000 Gooogle News (100B) 1.4M Google word2vec, skip-gram
    Wikipedia+Gigaword 5 50 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 100 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 200 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Wikipedia+Gigaword 5 300 Wikipedia+Gigaword 5 (6B) 400,000 GloVe GloVe
    Common Crawl 42B 300 Common Crawl (42B) 1.9M GloVe GloVe
    Common Crawl 840B 300 Common Crawl (840B) 2.2M GloVe GloVe
    Twitter (2B Tweets) 25 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 50 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 100 Twitter (27B) ? GloVe GloVe
    Twitter (2B Tweets) 200 Twitter (27B) ? GloVe GloVe
    Wikipedia dependency 300 Wikipedia (?) 174,015 Levy & Goldberg word2vec modified
    DBPedia vectors (wiki2vec) 1000 Wikipedia (?) ? Idio word2vec


    Back to top Copyright © 2011-2018 Synthetic Intelligence Network