Language model

Short description: Statistical model of language

A language model is a model of the human brain's ability to produce natural language.^[1]^[2] Language models are useful for a variety of tasks, including speech recognition,^[3] machine translation,^[4] natural language generation (generating more human-like text), optical character recognition, route optimization,^[5] handwriting recognition,^[6] grammar induction,^[7] and information retrieval.^[8]^[9]

Large language models (LLMs), currently their most advanced form^[when?], are predominantly based on transformers trained on larger datasets (frequently using texts scraped from the public internet). They have superseded recurrent neural network-based models, which had previously superseded the purely statistical models, such as the word n-gram language model.

History

Noam Chomsky did pioneering work on language models in the 1950s by developing a theory of formal grammars.^[10]

In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like word n-gram language models, with probabilities for discrete combinations of words, made significant advances.

In the 2000s, continuous representations for words, such as word embeddings, began to replace discrete representations.^[11] Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender.

Pure statistical models

In 1980, the first significant statistical language model was proposed, and during the decade IBM performed 'Shannon-style' experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.^[12]

Models based on word n-grams

A word n-gram language model is a statistical model of language which calculates the probability of the next word in a sequence from a fixed size window of previous words. If one previous word is considered, it is a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.^[13]

Special tokens are introduced to denote the start and end of a sentence $⟨ s ⟩$ and $⟨ / s ⟩$ . To prevent a zero probability being assigned to unseen words, the probability of each seen word is slightly lowered to make room for the unseen words in a given corpus. To achieve this, various smoothing methods are used, from simple "add-one" smoothing (assigning a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated techniques, such as Good–Turing discounting or back-off models.

Word n-gram models have largely been superseded by recurrent neural network–based models, which in turn have been superseded by Transformer-based models often referred to as large language models.^[14]

Exponential

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

$P (w_{m} ∣ w_{1}, \dots, w_{m - 1}) = \frac{1}{Z (w_{1}, \dots, w_{m - 1})} \exp (a^{T} f (w_{1}, \dots, w_{m}))$

where $Z (w_{1}, \dots, w_{m - 1})$ is the partition function, $a$ is the parameter vector, and $f (w_{1}, \dots, w_{m})$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on $a$ or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).^[15] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.^[16]

Large language models

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation.^[17]^[18] The largest and most capable LLMs are generative pre-trained transformers (GPTs) and provide the core capabilities of chatbots such as ChatGPT, Gemini and Claude. LLMs can be fine-tuned for specific tasks or guided by prompt engineering.^[19] These models acquire predictive power regarding syntax, semantics, and ontologies^[20] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.^[21]

They consist of billions to trillions of parameters and operate as general-purpose sequence models, generating, summarizing, translating, and reasoning over text. LLMs represent a significant new technology in their ability to generalize across tasks with minimal task-specific supervision, enabling capabilities like conversational agents, code generation, knowledge retrieval, and automated reasoning that previously required bespoke systems.^[22]

LLMs evolved from earlier statistical and recurrent neural network approaches to language modeling. The transformer architecture, introduced in 2017, replaced recurrence with self-attention, allowing efficient parallelization, longer context handling, and scalable training on unprecedented data volumes.^[23] This innovation enabled models like GPT, BERT, and their successors, which demonstrated emergent behaviors at scale such as few-shot learning and compositional reasoning.^[24]

Reinforcement learning, particularly policy gradient algorithms, has been adapted to fine-tune LLMs for desired behaviors beyond raw next-token prediction.^[25] Reinforcement learning from human feedback (RLHF) applies these methods to optimize a policy, the LLM's output distribution, against reward signals derived from human or automated preference judgments.^[26] This has been critical for aligning model outputs with user expectations, improving factuality, reducing harmful responses, and enhancing task performance.

Benchmark evaluations for LLMs have evolved from narrow linguistic assessments toward comprehensive, multi-task evaluations measuring reasoning, factual accuracy, alignment, and safety.^[27]^[28] Hill climbing, iteratively optimizing models against benchmarks, has emerged as a dominant strategy, producing rapid incremental performance gains but raising concerns of overfitting to benchmarks rather than achieving genuine generalization or robust capability improvements.^[29]

Although sometimes matching human performance, it is not clear whether they are plausible cognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.^[30]

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.^[31]

Various data sets have been developed for use in evaluating language processing systems.^[32] These include:

Massive Multitask Language Understanding (MMLU)^[33]
Corpus of Linguistic Acceptability^[34]
GLUE benchmark^[35]
Microsoft Research Paraphrase Corpus^[36]
Multi-Genre Natural Language Inference
Question Natural Language Inference
Quora Question Pairs^[37]
Recognizing Textual Entailment^[38]
Semantic Textual Similarity Benchmark
SQuAD question answering Test^[39]
Stanford Sentiment Treebank^[40]
Winograd NLI
BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs^[41]

References

↑ Blank, Idan A. (November 2023). "What are large language models supposed to model?". Trends in Cognitive Sciences 27 (11): 987–989. doi:10.1016/j.tics.2023.08.006. PMID 37659920. "LLMs are supposed to model how utterances behave."
↑ Jurafsky, Dan; Martin, James H. (2021). "N-gram Language Models". Speech and Language Processing (3rd ed.). https://web.stanford.edu/~jurafsky/slp3/3.pdf. Retrieved 24 May 2022.
↑ Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
↑ Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" . Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
↑ Liu, Yang; Wu, Fanyou; Liu, Zhiyuan; Wang, Kai; Wang, Feiyue; Qu, Xiaobo (2023). "Can language models be used for real-world urban-delivery route optimization?". The Innovation 4 (6). doi:10.1016/j.xinn.2023.100520. PMID 37869471. Bibcode: 2023Innov...400520L.
↑ Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" . 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
↑ Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" . arXiv:1808.10000.
↑ Ponte, Jay M.; Croft, W. Bruce (1998). "A language modeling approach to information retrieval". Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008.
↑ Hiemstra, Djoerd (1998). "A linguistically motivated probabilistically model of information retrieval". Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34.
↑ Chomsky, N. (September 1956). "Three models for the description of language". IRE Transactions on Information Theory 2 (3): 113–124. doi:10.1109/TIT.1956.1056813. ISSN 2168-2712.
↑ "The Nature Of Life, The Nature Of Thinking: Looking Back On Eugene Charniak's Work And Life" (in en). 2022-02-22. https://cs.brown.edu/news/2022/02/22/the-nature-of-life-the-nature-of-thinking-looking-back-on-eugene-charniaks-work-and-life/.
↑ Rosenfeld, Ronald (2000). "Two decades of statistical language modeling: Where do we go from here?". Proceedings of the IEEE 88 (8): 1270–1278. doi:10.1109/5.880083. https://figshare.com/articles/journal_contribution/6611138.
↑ Cite error: Invalid <ref> tag; no text was provided for refs named jm
↑ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003). "A neural probabilistic language model". The Journal of Machine Learning Research 3: 1137–1155. https://dl.acm.org/doi/10.5555/944919.944966.
↑ Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks". https://karpathy.github.io/2015/05/21/rnn-effectiveness/.
↑ Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. 3. p. 3881. doi:10.4249/scholarpedia.3881. Bibcode: 2008SchpJ...3.3881B. http://www.scholarpedia.org/article/Neural_net_language_models. Retrieved 28 August 2015.
↑ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Matthew; Bernstein, Michael S.; Bohg, Jeannette et al. (2021). On the Opportunities and Risks of Foundation Models.
↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda (2020). "Language Models are Few-Shot Learners". arXiv:2005.14165 [cs.CL].
↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav et al. (Dec 2020). Larochelle, H.; Ranzato, M.; Hadsell, R. et al.. eds. "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems (Curran Associates, Inc.) 33: 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Retrieved 2023-03-14.
↑ Fathallah, Nadeen; Das, Arunav; De Giorgis, Stefano; Poltronieri, Andrea; Haase, Peter; Kovriguina, Liubov (2024-05-26). "NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning". Extended Semantic Web Conference 2024. Hersonissos, Greece. https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf.
↑ Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus 151 (2): 127–138. doi:10.1162/daed_a_01905. https://www.amacad.org/publication/human-language-understanding-reasoning. Retrieved 2023-03-09.
↑ Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361 [cs.LG].
↑ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". arXiv:1706.03762 [cs.CL].
↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
↑ Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario (2017). "Deep Reinforcement Learning from Human Preferences". arXiv:1706.03741 [stat.ML].
↑ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex (2022). "Training language models to follow instructions with human feedback". arXiv:2203.02155 [cs.CL].
↑ Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv:1804.07461 [cs.CL].
↑ Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2020). Measuring Massive Multitask Language Understanding.
↑ Recht, Benjamin; Roelofs, Rebecca; Schmidt, Ludwig; Shankar, Vaishaal (2019). "Do ImageNet Classifiers Generalize to ImageNet?". arXiv:1902.10811 [cs.CV].
↑ Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (2018-01-09) (in en). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5. https://books.google.com/books?id=XoxsDwAAQBAJ&dq=adger+%22goldilocks%22&pg=PA153. Retrieved 11 December 2021.
↑ Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations", International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260, doi:10.1007/978-3-319-64206-2_8, ISBN 978-3-319-64205-5
↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018-10-10). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
↑ Hendrycks, Dan (2023-03-14), Measuring Massive Multitask Language Understanding, https://github.com/hendrycks/test, retrieved 2023-03-15
↑ "The Corpus of Linguistic Acceptability (CoLA)". https://nyu-mll.github.io/CoLA/.
↑ "GLUE Benchmark" (in en). https://gluebenchmark.com/.
↑ "Microsoft Research Paraphrase Corpus" (in en-us). https://www.microsoft.com/en-us/download/details.aspx?id=52398.
↑ Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 978-3-319-64205-5
↑ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment". http://l2r.cs.uiuc.edu/~danr/Teaching/CS546-12/TeChapter.pdf.
↑ "The Stanford Question Answering Dataset". https://rajpurkar.github.io/SQuAD-explorer/.
↑ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". https://nlp.stanford.edu/sentiment/treebank.html.
↑ "llama/MODEL_CARD.md at main · meta-llama/llama" (in en). https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md.