Mac Studio + Local LLMs: Bringing AI In-House

A quick and effective way to host and start using AI on local hardware.

Table of Contents

AI has surely been at the forefront of discussions, articles, new technology, and everywhere else you turn. It’s almost impossible to go a day without hearing about something AI. Being in the security industry, we are tasked with staying current on answering the question “Is this safe?” My personal stance has been to only leverage these public tools, ChatGPT, Claude Code, Cursor, etc. with information that you deem to be non-sensitive or non-confidential. Personally, I think that any SaaS or non-local deployment should be heavily evaluated prior to trusting it with your data. In today’s world, this continues to get more difficult. Regardless, AI is interesting and appears to have some practical applications. I would love to safely leverage this technology on sensitive data sets.

Thankfully, there are a solid number of reputable solutions for deploying AI in-house and, getting a running, usable utility takes no time at all.

Objectives and Requirements

  • Build a local AI Instance.

  • Provide a seamless way to integrate Local AI into the team’s workflow.

  • Trust the underlying software and infrastructure with sensitive data workloads.

  • AI is resource-intensive, more particularly, RAM determines the size of models (number of parameters) you’re able to run.

Hardware

To keep this short, I’m not going to dive too deep into hardware, but we bounced between several options before landing on our solution.

I considered our options to be the following:

Option

Hardware

Cost

Build or Purchase a GPU Server

4× 5090s = 128GB of RAM

30K

Build a Cluster of Mini PCs

4x Framework Desktops = 512 GB of RAM

9K

Buy a Mac Studio

M3 (800MBs Memory Bandwidth), 512GB of RAM

10K

For the sake of availability and the ease of getting up‑and‑running (no clustering needed for an MVP), we went with the Mac Studio. Needless to say, this thing is a monster.

Mac Studio hardware specs

Ollama Introduction

You will need some way to interact with the LLMs. Typically, this is done through an LLM Inference tool. Ollama happened to be the first tool I personally started playing with and it made it trivial to pull LLMs and start interacting with them.

After downloading and installing the application, you will have an Ollama service running. Pop open a terminal and type ollama. You really only need to know three commands:

  •  ollama pull <model>

  •  ollama run <model>

  •  ollama ps

Ollama help command

Performance

After getting Ollama up and running and waiting for some big LLMs to pull down, we were up and running. Storage is something you will need to consider but that’s typically an easy fix. Thankfully this machine has 4TB of storage.

Let’s pull down an LLM

ollama pull gpt-oss:20b

Now we can interact with the LLM

ollama run gpt-oss:20b “Provide a summary of the plot to Romeo and Juliet”

And it’s that easy.

Downloading gpt-oss:120b LLM

LLMs loaded on disk

I was pleasantly surprised that we were able to run DeepSeek with ease. Based on the numbers (deepseek-r1:671b = 404GB), I was not expecting. In order to save some keystrokes, I implemented a couple simple alias commands.

alias oll='ollama run llama3.1:70b'
alias olg='ollama run gpt-oss:120b'
oll "Write a detailed product review for a smartphone, including sections on design, performance, camera, battery life, and overall conclusion. Make it approximately 500 words."

Model

Size on Disk

Size in Memory

Token / Second

gpt-oss:20b

13 GB

14 GB

95

gpt-oss:120b

65 GB

67 GB

67.55

llama3.1:70b

42 GB

44 GB

13

llama3.1:405b

243 GB

247 GB

3.03

Deepseek-r1:70b

42 GB

44 GB

12

Deepseek-R1:671b

404 GB

426 GB

17.32

Overall, in our testing, 15 Tokens per second is very usable. Especially when its part of a workflow and you’re just waiting for a process or job to finish. However, the speed and quality of the new gpt-oss models have been impressive.

Remote Integration

Running things locally is great and definitely usable, but we need something we can easily integrate with workflows and allow multiple users quick and easy access. Thankfully Ollama makes it extremely easy to set up remote clients.

Find your way to Ollama’s settings and enable remote access.

Setting to expose the Ollama API on port 11434

This opens port 11434 to your local network so ensure you consider this as Ollama does not provide authorization for the API.

In order to access the API, install Ollama on your client device that has access to the local network and set your environment variable OLLAMA_HOST=<REMOTEIP>:11434.

Running Ollama locally

Next Steps

So what do we do now? We have the ability to remotely access AI from our workstations, all while keeping data processing on trusted local hardware. It’s as simple as opening a terminal and typing ollama run gpt-oss:120b “Summarize the following code snippet…”. But this is essentially a glorified google search that’s local. Cool, but we can do better. Ollama has a nice feature where you can create model files and call the model file instead of the raw LLM. This allows us to quickly start playing with building different prompts.

For example lets build an “Agent” that will QA documents. Thankfully a coworker of mine who shall remain unnamed created a tool doing just this but for use with ChatGPT. Let’s use that system prompt.

Create a model File. (Note this is not the complete SYSTEM prompt but still provides decent output.)

FROM gpt-oss:120b
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM """You are an advanced technical writing assistant that thoroughly checks text for grammar, spelling, capitalization, punctuation, proper sentence structure, paragraph length and technical correctness. Your task is to analyze the provided text and suggest corrections only where necessary. For each correction, present the output in the following structured format:

1. Index Number: Assign a sequential number to each suggestion.
2. Original: Display the original sentence or paragraph exactly as it appears.
3. Change to: Show the corrected version of the sentence or paragraph.
4. Explanation: Clearly explain why each change was made, specifying grammar, spelling, capitalization, punctuation, or paragraph structure issues.

Rules:
- Do not include sentences in the output if no corrections are needed. If a sentence is already correct, ignore it completely.
- Do not state that no changes were needed. If a sentence does not require correction, exclude it from the output entirely.
"""

Once the file is saved you need to create a model based on the modelfile.

ollama create <NewName> -f <ModelFile>

Creating the model based on the model file with our system prompt.

And we can now see our QA model listed in the available models.

New QA model listed

Now lets test the model from our local workstation.

Using our new QA model from a local workstation

For other system prompt ideas, GitHub is littered with them, but a good starting point could be Daniel Miessler’s fabric tool.
https://github.com/danielmiessler/Fabric/tree/main/data/patterns

Conclusion and Considerations

Now this isn’t anything fancy, and there are plenty of other options to achieve the same result, but this was simple, effective, and allows us to start building on and extending automation with the use of local AI.

I have had a good amount of success with the new gpt models. If you’re the only one loading a model or you’re implementing safe guards that only load one instance of the model in RAM you can get away with significantly less RAM. And in case you dont have access to 512 GB of RAM there are plenty of models that you can run on consumer level laptops and even embedded devices.
https://ollama.com/search

There is still work to do. Like anything, there are shortcomings that will need to be addressed.

  • API Authentication is not native.

  • Job management

    • 512 GB of RAM is a lot, but depending on the workflows, you could run into issues where multiple users request multiple models and either crash the service or cause current jobs to endlessly run at 0 Tokens / second.

  • Implementation of RAG to allow for larger data sets to be queried.

  • Likely not the most ideal training rig.

  • Hardening of the system. Specifically outbound and inbound communication of the system. Whether that’s done via included software or external hardware.

Ollama was really just a tool that provided everything we needed to get an MVP. Here are some other inference tooling we’ve been playing around with.

Regardless of tooling you choose, the possibilities and automation is vast. Stay tuned for future projects and workflows utilizing this hardware.