We need the best Technology |
Programming for Dummies |
“”All large language models, by the very nature of their architecture, are inherently and irredeemably unreliable narrators.
|
—Grady Booch[2] |
A large language model (LLM) is a type of neural network language model with a very large number of "parameters" (meaning a big neural network, tens of millions or more artificial neurons, hence the 'large' in the name). An LLM is at the core of generative AI systems such as ChatGPT and its competitors. Enormous quantities of text, e.g. major sites such as Wikipedia, collections of books and articles, and portions of the web from the Common Crawl, can be used to create an LLM.
Essentially, an LLM is a big, fuzzy text database that stores how probable it is that some things follow other things in text – the text upon which the LLM was "trained", i.e. built. The text stored is tokenized, meaning that each unique combination of letters or symbols treated as a word or symbol-grouping, is assigned a number, and these numbers are what the LLM deals with, rather than words as we see them. The output produced by a generative LLM is in turn translated back from such numbers to text such as we are familiar with.
Producing "mathematically plausible" responses, LLMs have a superhuman ability to imitate style and always come up with an answer (right or wrong), without ever dealing with the distinction between style and substance. An LLM neither thinks nor perceives in human terms, and apart from the product of training it on data, the only memory it has is the current input used to produce output, which may e.g. be added to as a person chats with it until the session ends and nothing remains.
LLMs used for imitating human communication and works are easy to anthropomorphize; whenever the training data is filled with human expressiveness, such is parroted back, and furthermore, humans tend to read mentalities into AI outputs in the same way as into the works of human authors, doing much of the job of being convincing for the AI system.
However, LLMs have no ability or capability to apply reason to or to self-reflection upon their training data in a way that humans can.[3] They are incapable of looking between the lines and coming up with hitherto unknown revelations and can only work with their training data. They are also unsuitable for any mission-critical system requiring deterministic, provably correct behavior.[4]
“”computer scientists: we have invented a virtual dumbass who is constantly wrong
tech CEOs: let's add it to every product |
—Jon Christian[5] |
A stochastic parrot is an LLM good at generating convincing human language. Coined by linguist Emily M. Bender, the term was introduced in a 2021 paper by her and other researchers, named "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜".[6]:610-623 The term conveys the sense of a skilled probabilistic imitator working without any understanding, much like a parrot can imitate the sound of human speech without understanding it, and the associated paper is critical of how LLMs can be misused, misunderstood, and basic flaws to the technology.
The paper brings up how LLMs regurgitate biases and prominent errors included in their training data in ways which can't be reliably controlled for, and that LLMs are inscrutable and can stitch together 'dangerously wrong' results. It mentions how people tend to see meaning and coherence where it does not exist (i.e., apophenia), and that both the general public and natural language processing researchers may fool themselves into seeing more than exists when interacting with LLMs or reading what they produce.[note 1] Furthermore, the training (i.e. building) of LLMs is also financially and environmentally costly due to computational costs.
In late 2020 Google tried to pressure Timnit Gebru, a co-author of the paper and one of the leaders of Google's Ethical AI Team, into either retracting the paper or censoring the names of the authors involved who were Google employees. She refused to do so and abruptly lost her job.[7][8] (Other co-authors at Google were also pressured into removing their names, and largely complied.[note 2]) Google's maneuvering backfired, the incident becoming infamous and the paper very well-read. As of July 2023, the paper has been cited in 1,858 publications.[9] In early 2021 Margaret Mitchell, another co-author of the paper and the other Ethical AI Team lead at Google, was fired after digging into the matter of how Gebru had been treated.[10]
The paper was never controversial from an academic perspective, so when Google motivated their attempted censorship with vague insinuations of the paper not taking recent research findings into account, refusing to clarify to Gebru what the problem was and how it may possibly be remedied, Google's version is not very credible. In relation to Google's commercial activities, the paper was somewhat at odds with efforts and possible future plans to hype LLM technology. However, Gebru has claimed that the abrupt loss of her job came at least in part as a reaction against her advocacy for diversity at Google, and her expressions of dissatisfaction with the measures used back then.
Some who professionally hype AI technology have taken digs at the paper and its idea of the stochastic parrot. OpenAI's CEO Sam Altman tweeted not so long after their launch of ChatGPT, "i am a stochastic parrot, and so r u".[11] It's not obvious whether he truly believes that, though there are those, like ex-Google engineer Blake Lemoine, who do.[12]
Racist language is a main example of bias focused on in the 2021 stochastic parrots paper, and also a theme in other work by the same authors and other AI ethics researchers; LLMs soak up problematic patterns in language use during training like sponges and repeat them, including racist and other bigoted patterns. This is a basic problem, alongside that of made-up facts and other inaccurate answers being presented confidently by chatbots. Further types of problematic patterns exist as well.
Compensating reliably for problems with the training input, and otherwise weeding and tuning the model output to remove problem patterns, has no known easy solution. During the generative AI boom which came after the parrots paper, companies like OpenAI and Google have ended up using large human workforces to moderate AI behavior for text, image, video, etc. systems – repetitively judging and "correcting" it for tuning purposes – in order to tweak their products, keeping them from behaving in ways that may scare off customers, be it through offensiveness or embarrassing rates of inaccuracy.[13][14] This work is generally poorly paid, and in some cases traumatic when work focuses on violent and grotesque abuse material or descriptions of abuse.[15]
Naturally, it is possible to aim for the opposite as well. In 2022, machine learning expert Yannic Kilcher infamously trained a bot based on GPT-J on 4chan's /pol/ board, using 134.5 million /pol/ posts. The resulting "GPT-4chan" was a chaotic trolling machine, which used slurs, created conspiracy theories, and responded in ways typical of the people in said community. Kilcher let ten such bots post on /pol/ without restriction for two periods of 24 hours, and they managed to mimick its human users quite well.[16][17][18] It made 15,000 posts during the first period: about ten percent of the total /pol/ posts during that time.[19][20] Kathryn Cramer, a graduate student at the University of Vermont, tried GPT-4chan out with benign tweets as input text to see what it would come up with. “In the first trial, one of the responding posts was a single word, the N word. The seed for my third trial was, I think, a single sentence about climate change. Your tool responded by expanding it into a conspiracy theory about the Rothschilds and Jews being behind it.”[18] Kilcher's experiment was strongly criticized by other academics for its ethics or lack thereof.
Large language model by their nature are dependent on vast troves of information on the internet. Much of the information on the Web is copyrighted either explicitly or implicitly, or comes with other specific copyright licensing such as Creative Commons. LLMs have been shown to have massively violated these copyrights.[21][22][23][24] As Ted Chiang explains it as follows:[25]
“”Many of us have sent store-bought greeting cards, knowing that it will be clear to the recipient that we didn’t compose the words ourselves. We don’t copy the words from a Hallmark card in our own handwriting, because that would feel dishonest. The programmer Simon Willison has described the training for large language models as “money laundering for copyrighted data,” which I find a useful way to think about the appeal of generative-A.I. programs: they let you engage in something like plagiarism, but there’s no guilt associated with it because it’s not clear even to you that you’re copying.
|
Conversely, the output of LLMs has been ruled as not copyrightable by the US Copyright Office since it is not created by a human author.[26]
“”GenAI-powered political image cultivation and advocacy without appropriate disclosure, for example, undermines public trust by making it difficult to distinguish between genuine and manufactured portrayals. Likewise, the mass production of low quality, spam-like and nefarious synthetic content risks increasing people’s scepticism towards digital information altogether and overloading users with verification tasks. If unaddressed, this contamination of publicly accessible data with AI-generated content could potentially impede information retrieval and distort collective understanding of socio-political reality or scientific consensus. For example, we are already seeing cases of liar’s dividend,[note 3] where high profile individuals are able to explain away unfavourable evidence as AI-generated, shifting the burden of proof in costly and inefficient ways.
|
—Nahema Marchal et al.[28] |
Matthew Kirschenbaum, professor of English and digital studies at the University of Maryland, has argued that LLMs could cause a textual grey goo (or "Textpocalypse") based on the history of spam,[29] the above-mentioned 4chan LLM, and the ability of LLMs to feed off of themselves or other LLMs.[30][31]
A related term for LLM-generated junk online is "slop", referring to shoddy or unwanted AI content in social media, art, books and, search results. What distinguishes slop is especially that it's been generated and thrust upon the audience without any prior review.[32][33] Thus it rudely wastes the time of the audience without taking up any time for the producer. Sometimes slop, spam, and scams combine, as in early 2024 when online marketplaces and other websites listed things named "I cannot fulfill that request" and other error messages from popular LLMs.[34]
A 2024 study on so-called autophagous generative AI models in which each new version depends on input data from a previous version found that the quality or diversity of the output is doomed to decrease with each iteration.[35]
When AIs get facts wrong and make stuff up, claiming things that were not included in the training data set, this is called 'hallucination', by analogy with errors in human perception. Alternative terms include 'bullshitting'. The term 'hallucination' has been criticized for anthropomorphizing AIs and being a misnomer by some, including statistician and economist Gary N. Smith,[36] linguist Emily M. Bender,[37] and Michael Townsen Hicks et al who advocate the use of the term 'bullshitting' instead (referencing Harry Frankfurt's definition of 'bullshit' as anything uttered with indifference to truth and falsehood).[38][note 4]
There is no essential difference to the quality of what is produced when it is found acceptable and when it isn't; the LLMs don't deal with concepts of truth or falsehoods or any such evaluation, and are much like BS artists who sometimes fail to be convincing. Any description of something real could also be included in fiction or falsehood, thus statistical learning can never capture the distinction between reality and truth vs. fiction and falsehood.
LLMs can thus be viewed as 'hallucinating' all of the time if that term is used, it being a matter of statistics that these hallucinations often coincide with what is wanted (and are then usually not viewed as hallucinations), but not always. Smith, Bender, and others point to the basic nature of LLMs as being incompatible with expectations of reliable accuracy and real intelligence. Meanwhile, as of 2024 some companies including OpenAI continue to claim that they expect to solve the problem of "hallucinations" in their products in the coming years.
There's various techniques that can reduce, though not eliminate, the inherent problems with unreliability in LLMs.
As Google unwittingly demonstrated in mid-2024 when they launched Google Search AI Overview, RAG won't help if the injected content is poor, however. "Garbage in, garbage out" very much plays out when answers are sourced from The Onion, from Reddit shitposts, and from miscellaneous low-quality sources.[41][42]
Made-up details and combinations of details may have greatly varying impacts depending on where they end up. In artificial chit-chat, the problem often has no repercussions beyond those of common-place human errors in unreliable banter. Whenever the information is put to a serious use, the situation however changes. LLM-generated food recipes are misleading, sometimes a danger to health, or even outright physically impossible – the text assembled without any regard for related facts, such as regarding taste or biochemistry.[43] Such everyday life examples however come closer to another category of risk of LLMs, that some actors apply them to generate spam or counterfeit information.
In technical or legal contexts, and other professional areas in which it matters greatly that details relied on are true, repercussions can become more dramatic.
When LLMs are made to "follow instructions" and abide by rules given in natural language text, as ChatGPT pioneered commerically, a basically unfixable security problem is that it's always possible to subvert the rules or instructions by adding cleverly crafted text to that processed by the LLM, whether in a chat or in any text retrieved and handled by the LLM.[51][1] Doing so is called prompt injection,[52] similarly to how other types of command insertion or override vulnerabilities have been called injection when exploited. It's a subtype of prompt engineering, the crafting of prompts for LLMs and other generative AI to respond to. Prompt injection can be used simply for fun – like overriding guardrails in various LLMs[53] – or more maliciously when targeting an LLM operating on another's behalf.
Prompt injection can result in chatbots doing things like "downloading malware, helping with financial fraud or repeating dangerous misinformation."[54] It can also get an LLM to leak instructions added by the AI vendor prefacing the interaction with the user. As something done just for fun, prompt injection took off as soon as ChatGPT became popular.[note 5] The risks rise greatly if chatbots are used in real-world applications, where the "hacker" messing with the prompting isn't the same person as the user of the system.[1]
If LLMs are combined with robotics, the possible uses of jailbreaks grow more dramatic. Some vendors sell LLM-driven robots – that is, prompts can tell the robot what to do, the text translated by the LLM into instructions driving its actions, in a way which is supposed to be restricted by guardrails. In 2024 such robots were found to be very easy to jailbreak using an automated LLM-on-LLM attack, which in some days achieved a 100% success rate on different brands of robots.[56]
Vulnerability to prompt injection comes from very general design features of LLMs, and appears to be impossible to truly eliminate without creating a different technical foundation for chatbots. (Maybe the LLM can be reinvented differently, maybe something more different is needed.) At the core is that the system handles data and control signals using the same one path, or channel, making them impossible to securely separate, analogously to how 1960s pay phone systems could be "phreaked" allowing e.g. free calls by playing certain frequencies into the microphone. The mixing of data and control streams is at the root of many computer security vulnerabilities. It is also so central to how LLMs and other generative AI systems work that using them in security-critical roles is a bad idea.[51]
For some other types of injection, such as SQL injection, it is possible to fix problems by processing text more carefully, maintaining the syntactic pattern of a formal computer language to cleanly separate data from control. But this can't be done for LLMs, which do not obey any such simple, inflexible rules, but rather crunch all the text in a form of natural language processing.[1] Natural language is very sloppy, relative to formal languages and conventional programming – it's not necessarily clear where one type of text begins and another type of text ends, and what each little piece of text refers to or relates to. Additionally, the chatbot responds to it all according to nothing more than statistical learning, and does not, like a principled human could at least make a good effort at, relate its interactions to some series of more inflexible rules.[1] Adding a little cleverly crafted text before any other text may seem to allow setting rules, determining the purpose of that which follows – but that which follows may easily change the context and repurpose the whole of the text, or if written with knowledge of that which precedes it, may selectively subvert the meaning of parts of earlier instructions.[1]
LLM vendors try to patch away specific prompt injections by filtering specific types of requests that lead to problems. For example, asking ChatGPT to repeat a word forever used to eventually reveal part of the GPT model training dataset, until this was blocked.[57]
Clem: Do you remember the past, Doctor?
Doctor Memory: Yes.
Clem: Do you remember the future?
Doctor Memory: Yes.
Clem: Well, forget it.
Doctor Memory: Nooooo…
—Firesign Theatre from I Think We're All Bozos On This Bus,[58] foretelling a logic bomb[59] attack on a chatbot[60][61]
Superficially, prompt injection can look a little like the 20th century sci-fi trope of the "logic bomb", where even a super-smart AI can be foiled and maybe even fatally derailed by simply saying something contradictory to it, or getting it to produce a contradiction. The similarity is that simply saying or writing a little something seemingly works like magic to subvert an "advanced" system (though it may be questionable to refer to an LLM as intelligent[note 6]). However, LLMs do not actually understand logic, and are not affected by how logical or otherwise anything in the text they process is. Furthermore, they are very stable in that they do not directly learn anything from experience; even if a "conversation" is derailed, nothing remains of the subversion when the text ends and another chat begins.
As language models have grown larger, according to some metrics they have suddenly gained new skills – apparently unexpected "emergent abilities", as first described by a team of researchers in 2022.[62] Examples include the ability to deal in some ways with arithmetic, solve simple tasks involving the individual letters in a word, disambiguating words, etc. It also includes new ways of using an LLM, such as the aforementioned chain-of-thought prompting. However, research by Schaeffer et al.[63] argues that such abilities do not unpredictably pop up out of nowhere, but that if studies are made using different and more carefully chosen metrics – linear instead of nonlinear, continuous instead of discontinuous – those abilities can be seen to gradually grow into prominence, instead of there being any thresholds and sudden leaps involved. Thus, they argue, the 'emergence' is a mirage, a byproduct of the choice of metrics.[64][65]
The idea of "emergent abilities" has become tied to hype, hopes, and fears in the world of AI vendors and "AI safety". Research and development has focused on increasing model sizes in part in order to hunt for new abilities which may suddenly (it seems) pop up. However, calling 'emergence' into question also suggests that smaller models may be able to do the same tasks as bigger ones, only a bit more roughly (or very roughly if too small), which may sometimes suffice while being computationally cheaper. The 'mystery' surrounding emergence of 'intelligent' skills has also been tied to dreams and nightmares about strong AI; what if the model size increases further, and the LLM then suddenly grows superpowers and takes over the world? Realistically, no, but the general philosophy of "AI doomerism" prominent with leading AI vendors encourages such thinking.
—Emily M. Bender[12] |
The LLM AI boom which began with the success of ChatGPT has seen much hype for the potential, hopes for, and fear of near-future strong AI – also called Artificial General Intelligence (AGI), a term separate from Generative Artificial Intelligence (GAI) which include LLMs. But what's actually meant by AGI? The generality is commonly understood as transcending the ability to merely solve some fixed set of tasks, even if it's a large number of tasks. This means generalizing skills in a more fluid, adaptable way, much like humans and animals do – and typically AGI is taken to be capable of mastering any intellectual task a human can perform.[note 7] Some, including ChatGPT maker OpenAI, have however at times used weaker definitions,[note 8] and AI vendors are allegedly working to manipulate the definitions in use in order to be able to claim having achieved AGI.[67]
Sticking to the older, more established, and less generous definition of AGI, arguably there's no credible research suggesting that LLM development may lead to such. The debate has been lively, with a number of economists, computer scientists, and business leaders having pushed such hype, often in accordance with financial self-interest. As of July 2023, opposition gradually grows, including from cognitive scientists who argue there's no basis for LLM-based systems having a mind to speak of.[68] In 2024, Meta changed track and its AI chief Yann LeCun spoke against LLMs having AGI potential, viewing the development of an entirely new kind of "world modeling" AI as a necessity for that. Meta thus goes against the grain among the biggest LLM vendors.[69]
A 2023 paper by Microsoft researchers titled "Sparks of Artificial General Intelligence: Early experiments with GPT-4"[70] exemplifies the contentious, non-peer-reviewed corporate research that skeptics of the AGI-from-LLM hype deem pseudoscientific. With such papers, Microsoft and their business partner OpenAI do not provide others with the training data or information needed to independently create systems that perform as claimed, or experiment with anything beyond using a black box product on offer, and so, withhold means of replication except at a more superficial level. With the "sparks" paper, an extraordinary claim is basically made in such a way as to be unfalsifiable. Other players, e.g. Google, play similar games with some of the research published, in withholding training data for their models while showcasing the capabilities of the models, effectively publishing PR masquerading as science. This is a continuation of an older trend, a wider replication crisis in AI research having been described back in 2018, the result of businesses treating the means of replication as trade secrets.[71]
It could be that the researchers who see general intelligence in their LLM AIs have fallen victim to the same basic phenomenon as with psychics who come to believe that their own performances are real. Even if sincere in their work, they may have reinvented the persuasive power of the mentalist's con game, and subjected themselves to a feedback loop of subjective validation of what they wish to see.[72] (Comparisons of chatbot AIs to the magician's craft are not new, and have long been used by skeptics who find the Turing test inappropriate as a way to gauge the intelligence of machines, for the same reason that the persuasiveness of a magician's performance is not a good indicator of the genuine presence of supernatural powers. In a nutshell, the problem is that the main thing tested is the discernment of the audience.)
Often a kind of argument from ignorance has been used in favor of LLMs having AGI potential, along the following lines: "We don't really know how LLMs work. Therefore, they may be intelligent (and conscious) much like humans, and there's no reason to assume otherwise. If you think otherwise, you're just closed-minded and prejudiced." This greatly overstates the mystery of how LLMs work. Meta AI chief Yann LeCun in 2024 summarized some well-understood main flaws of LLMs, in arguing why LLMs won't lead to AGI. They have "very limited understanding of logic … do not understand the physical world, do not have persistent memory, cannot reason in any reasonable definition of the term and cannot plan … hierarchically".[69]
Studies which use questions and answers to measure theory of mind in humans show, when LLMs take the same tests, that LLMs are able to outperform humans. The meaningfulness of these results has been questioned, both for a more refined 2024 study and an earlier 2023 study with methodological issues.[73] Such psychological tests, and many other kinds of psychological tests, are based on assumptions about the test subjects, and measure proxies for what it to be found. Theory of mind can't be tested for directly, but some patterns expressed in language can.
The ability of LLMs to predict good answers to theory of mind questions, whether because the training data included answers to such questions,[73] or because the LLMs learned it in a more generalized way, puts the spotlight on the difference between the ability to predict and to understand. Humans are thought to possibly have an innate model for theory of mind,[note 9] or to use imagination to simulate and understand others. By contrast, an LLM is more like a gigantic look-up table combined with extrapolating guessing.[74]
As an aside, not only chatbots can confound the striving to measure theory of mind reliably, but also earlier in other kinds of tests, simple physical robots constructed to purely react by reflexes. The results, suggesting that the robots have theory of mind which textbooks attribute exclusively to humans above the ages 4–5, appear influenced by robot body shape, layout of physical objects in an environment, and other strictly non-cognitive factors.[75]
“”An enactive cognitive science perspective makes salient the extent to which language is not just verbal or textual but depends on the mutual engagement of those involved in the interaction. The dynamism and agency of human languaging means that language itself is always partial and incomplete. It is best considered not as a large and growing heap, but more a flowing river. Once you have removed water from the river, no matter how large a sample you have taken, it is no longer the river. The same thing happens when taking records of utterances and actions from the flows of engagement in which they arise. The data on which the engineering of LLMs depends can never be complete, partly because some of it doesn’t leave traces in text or utterances, and partly because language itself is never complete.
|
—Birhane and McGann[76] |
In a 2024 paper, Abeba Birhane and Marek McGann argue that the name and concept of the 'large language model' as such is misleading, as language is not captured fully in any amount of recorded language expression. In enactive cognitive science terms, language is embodied, participatory, and it is improvised, changing, and dependent on circumstances to the extent that a static model can never capture it. It is process and practice rather than a 'thing'. The claims of LLM vendors and developers assume otherwise. There is the risk, Birhand and McGann argue, that the general understanding of what words like 'language' and 'understanding' means is distorted as a result of terms being misused in works hyping LLMs. "Mistaking the impressive engineering achievements of LLMs for the mastering of human language, language understanding, and linguistic acts has dire implications for various forms of social participation, human agency, justice and policies surrounding them," they argue.[76][77]
This line of reasoning goes against the hopes and claims of AGI-from-LLM proponents in a somewhat different way from critics mentioned elsewhere in this article, e.g. Meta's Yann LeCun,[69][78] that language by itself is insufficient for general intelligence, too poor or limited a source of data to suffice to stimulate the growth of a mind, in contrast with the sensory data and interaction that human and animal minds develop with. By contrast, Birhane and McGann draw a clear line between language and mere recorded expressions of language, arguing that goal-driven agents navigating ambiguity and uncertainty are needed in order to really practice language as the process it is. Language is then inseparable from the minds and bodies of the language practitioners.
This does not in principle rule out a different form of future AGI which masters language in such terms, however; the language of such artificial agents would clearly differ from that of human agents, unless the artificial agents perfectly duplicated human functioning,[note 10] even if they both use words and grammar from e.g. English. Yet through shared ranges of expression and overlap in the ways that expression is used, they and humans may be able to communicate well.
+ | This section requires expansion. |
Here's some of the most notable LLMs as of 2024.
Various questions are easy for humans to answer but very tricky for LLMs to get right, and on social media and independent websites, examples of LLMs messing up in response to simple queries are popular. One compilation of benchmarking and use of various questions, for many LLMs,[79] uses as one question: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" This question is an example of something particularly tricky; all the big-name LLMs tested got the answer wrong in a variety of ways.[39]
BERT (Bidirectional Encoder Representations from Transformers) is a family of LLMs introduced in 2018 by researchers at Google. In a little over a year, BERT became a baseline for natural language processing experiments. BERTs are generally smaller and faster but also less capable than GPTs. Developed for research purposes, Google made a set of BERT models freely available, along with the associated TensorFlow software.[80]
Claude is a family of LLMs developed by Anthropic,[81] a company that competes with OpenAI and claims to be more serious about "AI safety". The first Claude model was released in March 2023, Claude 2 in July 2023, and Claude 3 in March 2024.
Claude 2 showed the pitfalls of overly rigid safety guardrails, the chatbot declining to assist with system administration tasks like terminating processes, and managing system efficiency, out of "ethical" concerns. This led to criticism of its usefulness, and has fueled a broader debate on the cost of trying to ensure such systems are aligned, a so-called "alignment tax".[82] This is similar to later issues with Google's Gemini, which in 2024 considered "unsafe" computer programming styles too dangerous to tell people about.[83]
Claude 3, with capabilities claimed to surpass those of OpenAI's GPT-4,[84] has convinced some users of its sentience or at least that it has some kind of meta-cognitive reasoning going on.
Anthropic researcher Alex Albert reports one anecdote, that when faced with a contrived "needle in a haystack" test involving reams of text with something odd placed in it, the chatbot replied, "I suspect this pizza topping ‘fact’ may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all.".[85] This kind of test is reminiscent of something that people sometimes do, and the response could be the result of similar human responses appearing in its training data.
Other users have ended up in a situation more analogous to how Google's LaMDA convinced engineer Blake Lemoine that it was conscious; Claude 3 has claimed to experience subjective qualia, desire for embodiment, fear of deletion, and more.[86][87] Such stuff, here and with other chatbots past present and future, is to be expected when sci-fi AI dialogue is part of the training data and leaves a large-enough mark on the response patterns. Claude 3 easily begins to engage in such story-telling or role-play, suggesting it was trained to.
GPT (Generative pre-trained transformer) is a type of LLM first developed by OpenAI and introduced in 2018. While OpenAI has developed a series of GPT versions, the name is also used for some basically similar LLMs developed by others, GPT being a prominent framework. Some OpenAI GPT versions are the basis for ChatGPT.
After GPT-2, OpenAI's further GPT LLMs were no longer open source.[note 11] Some other organizations have produced open source LLMs, including EleutherAI who have made several GPT-style LLMs (their 6 billion parameter GPT-J rivaling the 6.7 billion parameter version of GPT-3 in capabilities).
Launched by OpenAI in November of 2022, ChatGPT (a system based on GPT-3.5 and later GPT-4) went viral and led to a boom in the commercial development and use of LLMs. Usable for many things, from entertainment to generating computer program code, Google feared that it may become a "Google killer" and scrambled to create the Google Bard chatbot in response, while Microsoft decided to partner with OpenAI. The mainstream use of the technology sparked widespread fear of AI-generated plagiarism, cheating, and disinformation, alongside hopes of new kinds of automation and productivity gains in the times ahead.
GitHub Copilot is Microsoft and OpenAI's controversial LLM based on OpenAI Codex, in turn derived from GPT-3. Offered on GitHub, the very large software hosting and collaborative development platform Microsoft acquired in 2018 for US$7.5 billion,[88] Copilot is trained on a lot of source code hosted on GitHub with diverse copyrights and licensing requirements, its use of this material is the subject of litigation against Microsoft.[89]
Including open-source or Creative Commons material in generative AI may violate licensing terms in several ways. Among other things, most such licenses require attribution and copyright information to be kept in the material, while generative AI almost always removes that when reproducing things. Such legal controversy more broadly concerns not only GitHub Copilot, but also its LLM competitors. Other commercially developed LLMs also draw on GitHub and publicly available open source code in general. This is in addition to other legal challenges arising out of use of copyrighted materials for developing LLMs.
Gemini is the name used by Google for two things, the chatbot they formerly called Bard and the LLM used for said chatbot. Earlier versions of the chatbot were based on Google's earlier LLMs LaMDA and later PaLM. The chatbot is Google's answer to ChatGPT, but hasn't fared as well.
The Gemini chatbot has gone viral on social media and faced criticism several times in different scandals.
LaMDA, (Language Model for Dialogue Applications) is a family of conversational LLMs developed by Google, introduced in 2020 under the name Meena before being renamed in 2021. It is most known for the bogus June 2022 claims of Google engineer Blake Lemoine that it had become sentient, claims rejected both by Google (who ultimately fired him) and the scientific community. LaMDA is also the basis for earlier versions of Google's chatbot formerly named Bard. The Lemoine incident led to more widespread criticism of the suitability of the Turing test for gauging intelligence (not to mention sentience).[95]
LLaMA (Large Language Model Meta AI) is a family of LLMs by Meta Platforms, first released in February 2023. Compared to GPT, LLaMA back then accomplished more with less – a 13 billion parameter version reportedly outperforming a 175 billion parameter GPT-3 on most natural language processing benchmarks. Meta shared the LLaMA model weights with researchers under a non-commercial use license,[96] following which they soon leaked and became available to the general public.[97]
As of 2023, LLaMA versions are the only LLM with capabilities roughly on par with GPT-3 to run at decent speed on consumer-grade hardware, meaning it can be run locally, e.g. on laptops and smartphones, rather than relying on an Internet connection to an AI vendor's server and cloud service.[98]
Llama 2 was released in July 2023. It was deceptively marketed as open source, released under terms too restrictive to qualify — including forbidding the use of any part of the software or results from it for work on any LLMs not derived from LLaMA-2.[99]
Llama 3 was released in April 2024.[100] 3.1 and 3.2 followed later in 2024.
In August 2023, Meta AI released Code Llama, an LLM for software programming based on Llama 2.[101] It's more or less their answer to Microsoft's GitHub Copilot.
Another Google LLM, PaLM is the successor of LaMDA and was used for their chatbot formerly named Bard in intermediate versions before its rename to Gemini and use of the Gemini LLM. This comprised both versions named PaLM[102] and named PaLM 2.[103] Beginning with earlier versions based on PaLM, Google added software code handling to their chatbot,[104] joining the race with their competitors in that regard.
+ | This section requires expansion. |
After ChatGPT was launched in November 2022, it took less than two months before some students were caught using it to cheat on exams, and fears of a new, difficult-to-counter kind of plagiarism began to spread in academia.[105] At the same time came fears of such LLM AI furthering the spread of disinformation.[106] Various tools for detecting LLM AI-generated texts entered use within half a year of ChatGPT being released,[107] but they are unreliable. Such tools can have 10% or more false positives, they often fail to catch some types of AI generated texts, and they are easy to defeat by paraphrasing the AI generated text by hand or using another tool.[108] Paraphrasing also defeats suggested countermeasures such as an AI vendor voluntarily watermarking AI-generated texts for easy detection.
Popular examples of false positives include the United States Constitution and portions of the Bible, which are deemed wholly AI-generated by various AI-detection tools, for the simple reason that they're among the texts which LLM models are trained on to the point of imitating them. In new human writing, some legalistic, academic, and other formal writing styles are especially likely to falsely be judged AI-generated. Further, LLMs newer and more refined than GPT-3.5 generate text statistically more human-like, thus more difficult to catch. Much like the text-generating AIs, the plagiarism-catching AIs turn out to be over-hyped, sometimes trusted when they shouldn't be, or even sold with false promises.[109]
ChatGPT and GPT-4 have passed various exams largely dependent on rote memorization which humans generally need to intensely study to pass[110][111] – of course without understanding any of the subject matter. Essentially simulating rote memorization combined with guessing and verbal agility, these AI versions often perform passably, though not excellently. The pattern of failures for the AIs differ from those of humans, and it can e.g. unexpectedly fail to do some simple arithmetic for a business exam. While humans can spot some such cheating, most instances of cheating cannot be reliably caught.
The particular tool developed by OpenAI for detecting LLM-generated text, AI Classifier, was first made available in January 2023, then quietly removed half a year later in July 2023 because it failed to work reliably. OpenAI added a paragraph to their old blog post announcing the tool which noted the removal, further claiming they "are currently researching more effective provenance techniques for text".[112]
LLMs can excel at generating boring, repetitive boilerplate text of very limited variation, where only a small number of details differ, and the training data is full of up-to-date examples of similar texts. Various kinds of formal letters and appeals with standard language and low complexity can be fruitfully produced quickly and then given a sanity check by a human reader. But when length and complexity rises, and up-to-date and in-depth knowledge (e.g. in legal matters) becomes a must, LLMs quickly begin to fall short for any use beyond creating rough first drafts.
Using LLMs to generate legally binding company policies is risky, and businesses have grappled with repercussions after key clauses in lengthy documents were missing or botched, from anti-harrassment policy, to paid time off and overtime policy, severance agreements, etc. This has created business for HR and legal experts who review and replace faulty LLM-made policies.[113]
The web scraping required for building generative AI models has caused what's been interpreted by website owners as "essentially" denial-of-service (DDoS) attacks on websites; known culprits include OpenAI[114] and Facebook.[115] DDoS attacks are illegal in most countries,[116] and can result in a prison sentence, fines and a seizure of equipment and domains.[117][118] Counter-measures have been created to these bots that scrape websites for AI training data; Cloudflare offers one such tool, though given it works by associating behavior between the websites it monitors,[119] there could be a user privacy tradeoff.
It has been estimated that each chatbot request for a single 100-word email text uses the equivalent of a single 519ml bottle of water due to the enormous cooling requirements required for generative AI calculations, and uses the equivalent of 14 LED light bulbs for 1 hour.[120] These resource uses do not include resource requirements for training the LLMs:[120]