AI & ML // July 13, 2024 // 8 min read

Unleashing AI Power at Your Fingertips: A Developer’s Guide to Running LLMs Locally

balakumar Senior Software Engineer

The AI landscape has evolved dramatically in recent years, and today, running Large Language Models (LLMs) locally has become a game-changing approach that offers numerous benefits. By leveraging local LLMs, you gain unparalleled privacy with on-device processing, eliminate recurring subscription costs, and enjoy the freedom to customize models to your specific needs. This local approach not only enhances security but also provides faster response times and the ability to work offline. Whether you're looking to boost productivity, protect sensitive data, or simply explore cutting-edge AI capabilities without breaking the bank, running LLMs locally is a powerful solution that every developer should consider.

The Platform: Ollama - Your Local LLM Powerhouse

Ollama remains the go-to platform for running LLMs locally, and here's why:

Ease of Use: With just a few commands, you can have a powerful LLM up and running on your machine.
Model Variety: Ollama supports a wide range of cutting-edge models.
Performance: Optimized for local execution, providing impressive speed even on consumer-grade hardware.
Active Community: A vibrant ecosystem of resources, tutorials, and support.

Get started with Ollama by visiting their official website.

Choosing the Right Model

When it comes to model selection, Ollama offers a diverse range of options to suit various needs:

Llama 3: The latest iteration of the Llama series, offering improved performance and capabilities over Llama 2.
Llama 3 Gradient: A variant of Llama 3 with an impressive 1 million token context window, ideal for processing large documents or conversations.
Gemma 2: Another excellent general-purpose model worth considering.
CodeStral: Currently one of the best options for coding-related tasks.
Dolphin: Known for its balanced performance across various tasks.
Mistral: A powerful model that excels in reasoning and language understanding.
Phi-2: Microsoft's compact yet capable model, great for resource-constrained environments.
Orca 2: Anthropic's model known for its strong reasoning capabilities.
Solar: Upstage's model, offering a good balance of performance and efficiency.
Neural Chat: A versatile model suitable for conversational AI applications.

Remember, smaller model sizes (e.g., 7B parameters) generally run faster than larger ones (e.g., 70B parameters), so consider your hardware capabilities and specific needs when choosing. Additionally, some models like Llama 3 Gradient offer extended context windows, which can be crucial for certain applications.

Browser Plugins: AI at Your Fingertips

Integrate your local LLM into your browsing experience:

BSummarizer (MY PLUGIN!): This powerful plugin summarizes web pages and YouTube videos with just a click. Link to my plugin
WebChatGPT: While primarily designed for ChatGPT, it can be configured to use Ollama's API, allowing you to interact with your local models directly from your browser. Chrome Web Store
Ollama Companion: A browser extension that integrates Ollama into your web browsing experience, allowing you to interact with your local models for various tasks. GitHub Repository
LocalAI Web UI: While not a browser plugin per se, this lightweight web interface can be accessed through your browser to interact with your local LLM powered by Ollama. GitHub Repository
Ollama Web UI: Another web-based interface for Ollama that can be accessed through your browser, offering a user-friendly way to interact with your local models. GitHub Repository

These tools allow you to harness the power of your local LLMs directly within your browser, enhancing your web browsing experience with AI-powered features while maintaining the privacy and control offered by local models.

VS Code: Your AI-Powered Coding Companion

Enhance your VS Code experience with these Ollama-compatible extensions:

ollama.vscode: Official Ollama extension for VS Code. VS Code Marketplace
Continue: AI-powered code completion and generation. VS Code Marketplace

IntelliJ: Turbocharge Your Java Development

For IntelliJ IDEA users, there are several plugins that support integration with Ollama and local LLMs:

BChat (MY PLUGIN!): Interacts with Ollama for chat interface, commit message generation, test code generation, and more. Link to my plugin
AI Assistant: Connects to local LLMs, including those run through Ollama, for code completion and generation. JetBrains Marketplace
LocalAI: Allows you to use your local LLM models directly within IntelliJ IDEA. JetBrains Marketplace
Ollama AI Assistant: Specifically designed to work with Ollama, offering code completion, explanation, and generation features. JetBrains Marketplace
CodeGPT: While primarily known for its OpenAI integration, it can be configured to work with local LLMs through Ollama. JetBrains Marketplace

These plugins offer a range of features from simple code completion to complex code generation and refactoring, all powered by your local LLM through Ollama.

General IDE Plugins: AI for Every Editor

Cross-platform options for AI-assisted coding:

Tabby: Open-source AI coding assistant with local LLM support. Official Website
Codeium: AI coding assistant configurable for local LLMs. Official Website

Mac Applications: Native AI Integration

For Mac users seeking seamless integration:

OllaMac: Native macOS app for interacting with Ollama models. GitHub Repository
BHelper (My App!): Native macOS app that uses artificial intelligence to help you write better and faster! BHelper lives in your Mac's menu bar and allows you to instantly rewrite any selected text using powerful AI models from OpenAI, Google, Anthropic, or even run models locally with Ollama. Link to my app
MacGPT: Configurable for local LLMs, offering system-wide shortcuts. Official Website

DIY Model Fine-tuning with Ollama

Ollama provides a straightforward way to create and use custom models based on existing ones. While it's not traditional fine-tuning in the machine learning sense, it allows you to adapt models to specific use cases.

Creating a Custom Model

Create a Modelfile

Create a text file named Modelfile (no extension) with the following structure:

FROM llama2

SYSTEM """You are a helpful AI assistant specialized in programming."""

TEMPLATE """[INST] {{.System}} {{.Prompt}} [/INST]"""

INCLUDE /path/to/your/data.txt

Customize the Modelfile
- FROM: Specify the base model (e.g., llama2, codellama, mistral)
- SYSTEM: Set a custom system message to guide the model's behavior
- TEMPLATE: Define how prompts should be formatted
- INCLUDE: Add external data or instructions (optional)
Create Your Model

Run the following command in the terminal:
```
ollama create mymodel -f ./Modelfile
```

Example: Creating a Programming Assistant

Create a Modelfile:

FROM codellama

SYSTEM """You are an AI programming assistant specialized in Python. Provide concise, 
efficient, and well-commented code examples."""

TEMPLATE """[INST] {{.System}} {{.Prompt}} [/INST]"""

INCLUDE python_snippets.txt

Create a python_snippets.txt file with relevant Python code examples and explanations.

Create the model:

ollama create python-assistant -f ./Modelfile

Use your custom model:

ollama run python-assistant "Write a Python function to calculate fibonacci numbers"

Best Practices

Choose the right base model for your use case.
Craft clear instructions in the SYSTEM prompt.
Provide relevant examples using the INCLUDE directive.
Iterate and refine based on model performance.

Advanced Techniques

Parameter Adjustment:

PARAMETER stop [INST]
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

Combining Models:

FROM llama2
MERGE mistral
MERGE codellama

Custom Tokenization:
```
TOKENIZER custom_tokens.json
```

Limitations and Considerations

This method doesn't retrain the model's weights but sets up specific instructions and context.
Effectiveness depends on the quality of your system prompt and included data.
Large inclusions may impact model creation time and memory usage.

By using Ollama's custom model creation, you can tailor existing models to your specific needs without the computational overhead of traditional fine-tuning.

Advanced LLM Integration

Push the boundaries of local LLM usage:

Custom ChatBots: Build interfaces using Gradio (Official Website) or Streamlit (Official Website).
API Wrappers: Create Python or Node.js wrappers for easy integration into your applications.
CI/CD Integration: Incorporate LLMs into your development pipeline for code review, documentation, and test case generation.

Optimizing Performance

To get the most out of your local LLM setup:

Hardware Acceleration: Utilize GPU acceleration when available for significantly faster inference.
Model Quantization: Use quantized models (e.g., 4-bit or 8-bit) for reduced memory usage and faster inference on less powerful hardware.
Prompt Engineering: Craft efficient prompts to get the most relevant and concise responses, reducing processing time.

Ethical Considerations and Best Practices

As we embrace local LLMs, it's crucial to consider:

Data Privacy: While local LLMs offer improved privacy, be mindful of the data you feed into them.
Bias Mitigation: Be aware of potential biases in model outputs and implement checks and balances in your workflows.
Continuous Learning: Stay updated with the latest developments in LLM technology and ethical AI practices.

The Future of Local LLMs

Looking ahead, we can expect:

Improved Efficiency: Future models will likely offer better performance on consumer hardware.
Specialized Models: More domain-specific models optimized for particular tasks or industries.
Enhanced Integration: Deeper integration with development tools and operating systems.

Conclusion: Embracing the Local LLM Revolution

Running LLMs locally is more than a trend; it's a paradigm shift in how we interact with AI as developers. It offers enhanced privacy, reduced latency, cost-effectiveness, and the ability to fine-tune models to your specific needs. As these models continue to evolve and local hardware becomes more powerful, the possibilities are truly exciting.

So, fire up Ollama, install some plugins, and start exploring the world of local LLMs. The power to transform your development workflow is quite literally at your fingertips, all while maintaining control over your data and resources.

Happy coding, and may your prompts be ever insightful!