Locally Running GenAI and Large Language Models with Ollama
If you are interested in exploring Generative AI without relying on cloud services, Ollama can run open models entirely locally, giving you a chance to explore GenAI APIs and capabilities.
Getting started
First off, install Ollama, for example, with Homebrew on macOS:
brew install ollama
Alternatively, download Ollama and follow the installation instructions. Once installed, start the Ollama server:
ollama serve
This allows your local machine to run GenAI models that you want to use.
To interact with a GenAI model, run the client specifying which model you'd like to use:
ollama run llama3
Replace llama3
with the name of the model of
your
choice.
Summarise or rewrite content
You can take input from local files, perhaps summarise a file:
ollama run llama3 "summarise this content in a paragraph"< content.md
Or ask for the content to be translated:
ollama run llama3 "translate this content into French" < content.md
Or proof read the content:
ollama run llama3 "proof read this file, "\
"don't reword it, just show me grammatical errors, "\
"punctuation errors and typos, ignore the frontmatter "\
"at the beginning, ignore line breaks" < content.md
Models, quality and performance
Running Gen AI locally quickly runs into the challenge of performance and what
your hardware is capable of. The llama3:latest
model is the 8B version, i.e.
the version that has has been created with 8 billion parameters.
❯ ollama list
NAME ID SIZE MODIFIED
llama3:latest a6990ed6be41 4.7 GB 6 hours ago
The number of the parameters is an indication of the size and complexity of the model. The 8B version has a file size of less that 5GB, downloads in a matter of minutes and runs at a reasonable speed in a few GB of memory.
We can see how fast this is, by looking at the response from a call to the generate API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "tell me a story of 100 words",
"stream": false
}' | jq '
del(.response,.context) *
{ tokens_per_second : (.eval_count*pow(10;9)/.eval_duration) }'
Which for me returns:
{
"model": "llama3",
"created_at": "2024-05-16T20:50:16.174569Z",
"done": true,
"done_reason": "stop",
"total_duration": 4466911333,
"load_duration": 1098708,
"prompt_eval_duration": 155598000,
"eval_count": 131,
"eval_duration": 4309431000,
"tokens_per_second": 30.398444713466812
}
On a Mac M1 Pro with 16GB, this is a response generation rate of 30 tokens per second, which is OK, albeit not great, from a speed perspective.
The llama3 70B model, on the other hand, has a large size of 39GB:
❯ ollama list
NAME ID SIZE MODIFIED
llama3:70b be39eb53a197 39 GB 18 minutes ago
This does not have a chance of running on my local hardware - CPU and memory usage ran high when I tried and the ollama server did not respond. Perhaps I need a Alienware Aurora R16 equipped with a NVIDIA RTX 4070 or high spec Apple Mac Studio to run that model locally.
The LMSYS Chatbot Arena Leaderboard, gives an indication of the quality of the different models. At the time of writing, Llama-3-8b is ranked 17, Llama-3-70b is ranked 7, and GPT-4o-2024-05-13 is ranked 1, based on the arena Elo rating. The Arena Elo rating is a score based on battles between each of models. It is worth noting the rate of improvement for the models. Llama-3-8b-Instruct, which runs comfortable on a decent laptop, ranks higher than GPT-3.5-Turbo-0613, which is ranked 29 and was the GPT 3.5 snapshot from June 13th 2023. GPT-4, it's successor, was released on March 14th 2023.
Full control for your AI experiments
Ollama provides an accessible and flexible platform for exploring Generative AI locally, without relying on cloud services. With its CLI interface, you can experiment with AI-generated text and the range of GenAI models available in a matter of minutes. We may not be able to run the larger models locally without specialised hardware, but running against smaller models may help with learning, prototyping of ideas and the quality may be OK for some basic tasks.