Locally Running GenAI and Large Language Models with Ollama

Published on May 16, 2024 by Ian Homer

If you are interested in exploring Generative AI without relying on cloud services, Ollama can run open models entirely locally, giving you a chance to explore GenAI APIs and capabilities.

Getting started

First off, install Ollama, for example, with Homebrew on macOS:

brew install ollama

Alternatively, download Ollama and follow the installation instructions. Once installed, start the Ollama server:

ollama serve

This allows your local machine to run GenAI models that you want to use.

To interact with a GenAI model, run the client specifying which model you'd like to use:

ollama run llama3

Replace llama3 with the name of the model of your choice.

Summarise or rewrite content

You can take input from local files, perhaps summarise a file:

ollama run llama3 "summarise this content in a paragraph"< content.md

Or ask for the content to be translated:

ollama run llama3 "translate this content into French" < content.md

Or proof read the content:

ollama run llama3 "proof read this file, "\
  "don't reword it, just show me grammatical errors, "\
  "punctuation errors and typos, ignore the frontmatter "\
  "at the beginning, ignore line breaks" < content.md

Models, quality and performance

Running Gen AI locally quickly runs into the challenge of performance and what your hardware is capable of. The llama3:latest model is the 8B version, i.e. the version that has has been created with 8 billion parameters.

❯ ollama list
NAME            ID              SIZE    MODIFIED
llama3:latest   a6990ed6be41    4.7 GB  6 hours ago

The number of the parameters is an indication of the size and complexity of the model. The 8B version has a file size of less that 5GB, downloads in a matter of minutes and runs at a reasonable speed in a few GB of memory.

We can see how fast this is, by looking at the response from a call to the generate API:

curl http://localhost:11434/api/generate -d '{
    "model": "llama3",
    "prompt": "tell me a story of 100 words",
    "stream": false
  }' | jq '
    del(.response,.context) *
    { tokens_per_second : (.eval_count*pow(10;9)/.eval_duration) }'

Which for me returns:

{
  "model": "llama3",
  "created_at": "2024-05-16T20:50:16.174569Z",
  "done": true,
  "done_reason": "stop",
  "total_duration": 4466911333,
  "load_duration": 1098708,
  "prompt_eval_duration": 155598000,
  "eval_count": 131,
  "eval_duration": 4309431000,
  "tokens_per_second": 30.398444713466812
}

On a Mac M1 Pro with 16GB, this is a response generation rate of 30 tokens per second, which is OK, albeit not great, from a speed perspective.

The llama3 70B model, on the other hand, has a large size of 39GB:

❯ ollama list
NAME            ID              SIZE    MODIFIED
llama3:70b      be39eb53a197    39 GB   18 minutes ago

This does not have a chance of running on my local hardware - CPU and memory usage ran high when I tried and the ollama server did not respond. Perhaps I need a Alienware Aurora R16 equipped with a NVIDIA RTX 4070 or high spec Apple Mac Studio to run that model locally.

The LMSYS Chatbot Arena Leaderboard, gives an indication of the quality of the different models. At the time of writing, Llama-3-8b is ranked 17, Llama-3-70b is ranked 7, and GPT-4o-2024-05-13 is ranked 1, based on the arena Elo rating. The Arena Elo rating is a score based on battles between each of models. It is worth noting the rate of improvement for the models. Llama-3-8b-Instruct, which runs comfortable on a decent laptop, ranks higher than GPT-3.5-Turbo-0613, which is ranked 29 and was the GPT 3.5 snapshot from June 13th 2023. GPT-4, it's successor, was released on March 14th 2023.

Full control for your AI experiments

Ollama provides an accessible and flexible platform for exploring Generative AI locally, without relying on cloud services. With its CLI interface, you can experiment with AI-generated text and the range of GenAI models available in a matter of minutes. We may not be able to run the larger models locally without specialised hardware, but running against smaller models may help with learning, prototyping of ideas and the quality may be OK for some basic tasks.