1 Introduction
We will examine the integration of various LLM models in LangChain in this article. We will look at and contrast the aspects of the platforms that enable these LLM types. Some of the most well-liked pre-trained models that are publicly accessible are already supported by LangChain. We have previously covered a number of alternatives in earlier posts, including ChatGPT, GPT-4, GPT-3, and GPT4ALL.
Nearly 30 connectors with well-known AI platforms are offered by this framework, including OpenAI, Cohere, Writer, and Replicate, to mention a few. Most significantly, they give you access to the Huggingface Hub API, which has more than 120K models available and is simple to integrate into your applications. There are numerous ways to use the services provided by these organisations.
The payment for the API interfaces is customary. The pricing are typically based on variables like the quantity of tokens handled, as seen in OpenAI, or the amount of GPU time required for the process, as seen in Huggingface Interface or Amazon Sagemaker. The majority of these choices are quick and simple to set up. Nevertheless, it is important to remember that even though the models were developed using your valuable datasets, you do not own them. Simple pay-as-you-go access to the API is all they offer unless they are explicitly open source e.g. hugging face.
On the other hand, it is feasible to host the models locally on your servers. You will be able to control the network and your dataset completely and only thanks to it. It is crucial to be conscious of the expenditures connected with this strategy’s hardware (a high-end GPU for low latency) and maintenance (the skills to deploy and fine-tune models). Several publicly accessible models, such LLaMA-1, are also inaccessible for commercial use - though of course the recently released LLaMA-2 is available for commerical use.
Depending on factors including money, model capabilities, knowledge, and trade secrets, the best strategy differs for each use case. By providing your data to OpenAI’s API, it is simple to develop a unique fine-tuned model. If the dataset is a part of your intellectual property and cannot be shared, on the other hand, you can think about performing fine-tuning internally.
Another thing to think about is the features of the various models. Its capacity to grasp languages is directly influenced by the network sizes and dataset quality. In contrast, the best solution isn’t usually a bigger model. The Ada variant of the GPT-3 has the lowest latency and is the fastest and most economical device in the collection. It is suitable for simpler jobs though, such text processing or classification. On the other hand, the most recent GPT-4 version is the biggest model to produce excellent outcomes for each task.
However, because of the numerous parameters, it is the most time-consuming and expensive alternative. As a result, choosing the model based on their aptitude is equally essential. Ada may be more cost-effective for developing a conversational application, but this is not the model’s goal and will yield unsatisfactory results. (This article compares several well-known LLMs; you can read it.)
The remainder of this article will introduce a number of LangChain integrations to aid in making the best decision.
2 Popular LLM models accessible to LangChain via API
2.1 Cohere Command
A variety of models are available through the Cohere service, including Command (command) for dialogue-like interactions, Generation (basic) for generative activities, Summarise (summarize-xlarge) for producing summaries, and more. Free, time-limited use is available for learning and prototyping. This indicates that use is cost-free up until you enter production; nevertheless, some models might cost a little more than OpenAI APIs once you do, such as $2.5 for creating 1K tokens. However, because Cohere provides increasingly tailored models for every job, this can result in a more use case-specific model having better results in subsequent tasks. It is simple to retrieve these models thanks to the LangChain’s Cohere class. Model=“MODEL_NAME>”, cohere_api_key=“API_KEY>”
2.2 GPT-3.5
OpenAI created the language model GPT-3.5. Its turbo version, which OpenAI advises over earlier iterations, provides a less expensive way to produce human-like writing via an API reachable via OpenAI endpoints. The model can process 96 languages and is tailored for chat applications while still being effective for other generating tasks. The most affordable option from the OpenAI collection, GPT-3.5-turbo costs only $0.002 per 1000 tokens and has a context length of up to 16K tokens. Use the gpt-3.5-turbo key when initialising the ChatOpenAI or OpenAI classes to gain access to this model’s API.
2.3 GPT-4
The GPT-4 model from OpenAI is a capable multimodal model with an unspecified amount of parameters or training steps. It is the newest and most powerful model that OpenAI has ever released, and thanks to its multi-modality, it can handle input from both text and images. Unfortunately, access to it requires filing an early access request via the OpenAI platform as it is not generally accessible. The model comes in two separate iterations called gpt-4 and gpt-4-32k, with context lengths of 8192 and 32768 tokens, respectively.
2.4 Jurassic-2
The Jurassic-2 language model from AI21 comes in three sizes and several price ranges: Jumbo, Grande, and Large. Although the model sizes are private, the Jumbo version is listed in their paperwork as the most potent model. They characterise the models as being excellent at every generating task and general-purpose. Seven languages are supported by their J2 model, which may be customised using unique datasets. You can access these models by using the AI21()class and obtaining your API key from the AI21 platform.
2.5 StableLM
Stable Diffusion created the StableLM Alpha language model, which is available via HuggingFace Hub (with the id stabilityai/stablelm-tuned-alpha-3b) to host locally or via Replicate API at a rate of $0.0002 to $0.0023 per second. There are now two sizes available: 3 billion and 7 billion parameters. The StableLM Alpha weights are accessible for commercial use and are distributed with a CC BY-SA 4.0 licence. StableLM uses a context length of 4096 tokens.
2.6 Dolly-v2-12B
Dolly-v2-12B is a language model developed by Databricks that may be accessible using Replicate API for the same price range as described in the previous section or HuggingFace Hub (with the id databricks/dolly-v2-3b) to host locally. It has 12 billion parameters and is accessible for commercial usage under an open source licence. Pythia-12B served as the foundation model for Dolly-v2-12B.
2.7 GPT4ALL
The LLaMA-1 model by meta with 7B parameters is the foundation of GPT4ALL. It is a Nomic-AI language model that may be used with GPT4ALL and Hugging Face Local Pipelines. The model is distributed under an open-source GPL 3.0 licence. However, it costs money to use it for business purposes. It is offered for use by researchers in their projects and investigations. In the previous lecture, we went through the capabilities and usage of this model.
3 LLM Platforms that can integrate into LangChain
3.1 Cohere
Cohere is a Canadian startup that specialises in NLP models that help businesses improve human-machine interactions. With 52 billion parameters, Cohere’s Cohere xlarge model is accessible via an API. Their embedding-based fee for their API is $1 for every 1000 embeddings. The Cohere package’s installation procedure, which is necessary to access their API, is simple to follow. By building prompts with input variables and passing them to the Cohere API to generate responses, developers may easily interact with Cohere models using LangChain.
3.2 OpenAI
One of the largest businesses specialising in extensive language models is OpenAI platform. They were the first service to bring the effectiveness of LLMs to the attention of the mainstream media by launching their conversational format, ChatGPT. Additionally, they offer a wide range of API endpoints at various price points for various NLP activities. For easy access, the LangChain library offers a variety of classes, like the ChatGPT and GPT4 classes that we have already seen in prior articles.
3.3 Hugging Face Hub
Natural language processing (NLP) technologies, such as pre-trained language models, are developed by the company Hugging Face, which also provides a platform for creating and utilising NLP models. 20k datasets and over 120k models are hosted on the platform. They provide the Spaces service so that researchers and developers may easily build a demo and highlight the possibilities of their model. Large-scale models like StableLM by Stability AI, Dolly by DataBricks, or Camel by Writer are hosted on the platform. The models are downloaded and initialised by the HuggingFaceHub class.
This opens up a wide range of models that are designed with Intel CPUs in mind. Models can be used with the aforementioned package with little to no code modification. It makes it possible for networks to benefit from Intel®’s® cutting-edge architectural designs, greatly enhancing the performance of CPU and GPU lines. For instance, the data show a 3.8 speedup when using the Intel® Xeon® 4s CPU to run the BLOOMZ model (text-to-image) in comparison to the previous version with no changes to the architecture or weights. The inference speed rate nearly doubled to 6.5 times its initial value when the aforementioned optimisation library was combined with a 4th generation Intel Xeon CPU. (online example) Two more well-known models that make use of these efficiency benefits are Whisper and GPT-J.
3.4 Amazon SageMakerEndpoint
The infrastructure provided by Amazon SageMaker makes it simple for customers to host and train their machine learning models. It is an environment with great performance and low cost for testing and using large-scale models. The LangChain library offers a straightforward user interface that makes it easier to query the deployed models. Therefore, writing API codes is not required in order to access the model. The endpoint_name, which is the model’s distinctive name from SageMaker, can be used to load a model, together with credentials_profile_name, which is the name of the profile you want to use for authentication.
3.5 Hugging Face Local Pipelines
Hugging Face Local Pipelines is a potent tool that enables users to use the HuggingFacePipeline class to execute Hugging Face models locally. The Hugging Face Model Hub is home to an incredible collection of over 120,000 models, 20,000 datasets, and 50,000 demo apps (Spaces), all of which are open source and publicly accessible. This makes it simple for people to work together and develop machine learning models.
Users can either use the HuggingFaceHub class to call the hosted inference endpoints or the local pipeline wrapper to access these models. The Transformers Python package needs to be installed before continuing. Once installed, users can use the model_id, task, and any other model parameters to load the specified model. By constructing a PromptTemplate and LLMChain object and passing the input through it, the model may then be merged into an LLMChain.
3.6 Azure OpenAI
The Azure platform from Microsoft enables access to OpenAI’s models as well.
3.7 AI21
Through their API, AI21 provides customers with access to their robust Jurassic-2 large language models. Their Jurassic-2 model, which boasts 178 billion parameters, is accessible via the API. For every 1,000 tokens, the API costs only $0.01, which is pretty affordable. The AI21 models can be readily interacted with by developers by using LangChain to create prompts that take input variables into account. Developers can benefit from their potent language processing skills with this straightforward method.
3.8 Aleph Alpha
The Luminous series of large language models is a product line offered by Aleph Alpha. The three models in the Luminous family—Luminous-base, Luminous-extended, and Luminous-supreme—vary in their levels of complexity and functionality. Token-based Aleph Alpha’s pricing model lists the basic prices for each model for every 1000 input tokens in the table. Each of the four Luminous models has a price per 1000 input tokens: Luminous-base costs 0.03€, Luminous-extended costs 0.045€, Luminous-supreme costs 0.175€, and Luminous-supreme-control costs 0.21875€.
3.9 Banana
Banana is a business that focuses on machine learning infrastructure and gives programmers the resources they need to create machine learning models. By installing the Banana package, which comes with an SDK for Python, one can use LangChain to communicate with Banana models. The BANANA_API_KEY and YOUR_MODEL_KEY, which can be acquired via their site, are the next two tokens needed. The YOUR_MODEL_KEY can be used to build an object after the keys have been set. Then, after making a PromptTemplate and an LLMChain object, it is feasible to include the Banana model into an LLMChain by passing the required input through it.
3.10 CerebriumAI
A great substitute for AWS Sagemaker that offers access to a number of LLM models via its API is CerebriumAI. Whisper, MT0, FlanT5, GPT-Neo, Roberta, Pygmalion, Tortoise, and GPT4All are a some of the pre-trained LLM models that are readily available. By including the endpoint URL and other pertinent characteristics like the maximum length, temperature, etc., developers establish an instance of CerebriumAI.
3.11 DeepInfra
DeepInfra is a distinctive API that provides a variety of LLMs, including whisper-large, gpt2, dolly-v2-12b, and distilbert-base-multilingual-cased. It utilises A100 GPUs that are tuned for inference performance and low latency, and it is connected to LangChain via API. DeepInfra’s pricing is significantly more reasonable than Replicate’s, at $0.0005/second and $0.03/minute. We are offered a one-hour free trial of serverless GPU computing with DeepInfra so that we can test out several models.
3.12 ForefrontAI
Users can adjust and use a variety of open-source big language models, including GPT-J, GPT-NeoX, T5, and others, using the ForefrontAI platform. The platform has several pricing tiers, including the $29/month Starter tier, which includes 5 million serverless tokens, 5 improved models, 1 user, and Discord support. Developers have access to a variety of models with ForefrontAI that may be customised to meet our unique needs.
3.13 GooseAI
GPT-Neo, Fairseq, and GPT-J are just a few of the models that are accessible through GooseAI, a fully managed NLP-as-a-Service platform. GooseAI’s pricing is determined by the usage and various model sizes. The base price for up to 25 tokens per request for the 125M model is $0.000035, plus an extra charge of $0.000001. Install the openai package and establish the Environment API Key, which can be obtained from GooseAI, in order to use GooseAI with LangChain. You can build a GooseAI instance and specify a Prompt Template for Question and Answer once you have the API key. The LLMChain can then be started, and you can supply a query to make it work.
3.14 Llama-cpp
The LangChain framework has been easily merged with Llama-cpp, a Python binding for llama.cpp. With the use of this connection, users can access a number of LLM (Large Language Model) models that Llama-cpp provides, such as LLaMA, Alpaca, GPT4All, Chinese LLaMA/Alpaca, Vigogne (French), Vicuna, Koala, OpenBuddy (Multilingual), Pygmalion 7B, and Metharme 7B. Users now have a variety of options thanks to this connection, depending on their individual requirements for language processing. Users can take advantage of the potent language models and produce humanistic and step-by-step answers to their input inquiries by integrating Llama-cpp into LangChain.
3.15 Manifest
With the help of the integration tool Manifest, LangChain can perform language processing tasks more effectively and with more ease. It serves as a link between local Hugging Face models and LangChain, making it simple for users to access and use these models within LangChain. Users now have better tools for doing language processing tasks because to Manifest’s smooth integration into LangChain. Users can follow the directions, which include installing the manifest-ml package and establishing the connection settings, to use Manifest within LangChain. Once linked, users can utilise LangChain and Manifest together for a complete language processing experience.
3.16 Modal
LangChain and Modal are completely connected, enhancing the processing workflow with strong cloud capabilities. Despite the fact that Modal doesn’t offer any particular language models (LLMs), it provides the architecture needed by LangChain to take use of serverless cloud computing. The advantages of on-demand access to cloud resources from Python programmes running on local PCs can be directly reaped by users by incorporating Modal into LangChain. The Modal server can be accessed by users after they install the Modal client library and create a new token. In the LangChain example, a PromptTemplate is built to format the input and a Modal LLM is instantiated using the endpoint URL. After that, LangChain does a language processing operation, such answering a question, while also executing the LLMChain with the supplied prompt.
3.17 NLP Cloud
For a wide range of natural language processing (NLP) operations, NLP Cloud’s seamless integration with LangChain offers a comprehensive array of high-performance pre-trained and custom models. These models can be accessed via a REST API and are created for use in the production environment. Users can easily carry out NLP tasks like answering inquiries by executing the LLMChain with the relevant prompt.
3.18 Petals
With Petals’ smooth integration with LangChain, more than 100 billion language models can be used in a decentralised architecture akin to BitTorrent. The information in this notebook explains how to integrate Petals into the LangChain workflow. Petals provides a wide variety of language models, and its connection with LangChain improves the capability of recognising and producing natural language. Petals uses a decentralised form of operation to give users strong language processing abilities in a distributed setting.
3.19 PipelineAI
Because PipelineAI and LangChain are fully connected, users may scale their machine learning models in the cloud. A variety of LLM (Large Language Model) models are also available via API access through PipelineAI. It consists of the models GPT-J, Stable Diffusion, ESRGAN, DALL-E, GPT-2, and GPT-Neo, each of which has unique model capabilities and parameters. Within the LangChain ecosystem, PipelineAI enables users to take use of the scalability and power of the cloud for their machine-learning workflows.
3.20 PredictionGuard
LangChain easily incorporates PredictionGuard, giving users a strong shell for using language models. The predictionguard and LangChain libraries must be installed before PredictionGuard can be used within the LangChain framework. For more complex operations, PredictionGuard can also be smoothly linked into LangChain’s LLMChain. PredictionGuard improves the LangChain experience by giving language model outputs an extra measure of security and control.
3.21 PromptLayer OpenAI
PredictionGuard offers users more control and administration of their GPT prompt engineering because it is completely linked into LangChain. The PromptLayer dashboard allows for the recording, tracking, and examination of OpenAI API calls by acting as a middleman between users’ code and the OpenAI Python library. Installing the ‘promptlayer’ package is necessary to use PromptLayer with OpenAI. The PromptLayer dashboard allows users to evaluate various templates and models by attaching templates to requests.
3.22 Replicate
Replicate offers a large selection of LLM models for diverse purposes and integrates effortlessly into LangChain. Vicuna-13b, Bark, Speaker-Transcription, Stablelm-Tuned-Alpha-7b, Kandinsky-2, and Stable-Diffusion are a few of the LLM models that Replicate provides. These models address a wide range of topics, including text-to-image creation, speaker transcription, generative audio, language production, and language modelling. Each model offers unique features and attributes, allowing users to select the model that best suits their requirements. Based on the computing power needed to execute the models, Replicate offers variable pricing choices. The deployment of unique machine learning models at scale is made simpler via replication. Users may effectively interact with these models by integrating Replicate into LangChain.
3.23 Runhouse
Runhouse offers strong remote computation and data management capabilities across various environments and users by being effortlessly integrated into LangChain. Runhouse gives you the option to employ on-demand GPUs from cloud service providers like AWS, GCP, and Azure or host models on your own GPU hardware. In LangChain, Runhouse offers a number of LLM models that can be used, including gpt2 and google/flan-t5-small. The preferred hardware configuration can be specified by users. Users can quickly build sophisticated language model workflows by fusing Runhouse with LangChain, facilitating effective model execution and collaboration across many contexts and users.
3.24 StochasticAI
By giving users a productive and user-friendly environment for model interaction and deployment, StochasticAI seeks to streamline the workflow of deep learning models within LangChain. It offers a simplified procedure for managing the lifecycle of Deep Learning models. The deployment of models into production is made easier by StochasticAI’s Acceleration Platform, which makes processes like model uploading, versioning, training, compression, and acceleration simple. Users may easily communicate with StochasticAI models within LangChain. StochasticAI offers the FLAN-T5, GPT-J, Stable Diffusion 1, and Stable Diffusion 2 LLM models. For a variety of language-related activities, these models provide a wide range of capabilities.
3.25 Writer
The writer is smoothly linked into LangChain, giving users a strong platform for producing material in a variety of languages. Users of LangChain may easily connect with a variety of LLM models to fulfil their language production needs thanks to Writer integration. Palmyra Small (128m), Palmyra 3B (3B), Palmyra Base (5B), Camel (5B), Palmyra Large (20B), InstructPalmyra (30B), Palmyra-R (30B), Palmyra-E (30B), and Silk Road are some of the LLM variants that Writer offers. These models provide various capacities for enhancing retrieval-augmented generation, generative pre-training, following instructions, and language comprehension.
4 Conclusion
It’s understable to have choice overload when integrating the aforementioned underlying principles. Because of this, we have described the many options in this article. Making an informed selection can be made easier with the use of this knowledge. You can choose to host the model locally or use a pay-as-you-go service, depending on your needs. While the latter may be more practical for individuals with less resources, the former will provide you total control over how the model is implemented. Whatever your preferences, it’s critical to pick the solution that best fits your requirements and financial constraints.
5 Acknowledgements
I’d like to express my thanks to the wonderful LangChain & Vector Databases in Production Course by Activeloop - which i completed, and acknowledge the use of some images and other materials from the course in this article.