Ollama Environment Variables: Configuration Settings and Syntax

Ever accidentally expose your Ollama server to the network with one mis-set env var?
Environment variables are how Ollama controls server binding, model paths, GPU selection, memory rules, proxies — no code changes, just key-value pairs and a restart.
That makes them powerful, but also a common source of outages and wasted time.
This post is a compact reference: what each Ollama variable does, the exact syntax for Linux/macOS/Windows, and the usual gotchas so you can tweak settings confidently and avoid surprise restarts.

Complete Reference of Ollama Environment Variables and Their Functions

yp1-HNSxQbqqzpFqiDExpA

Ollama leans on environment variables to configure pretty much everything: server behavior, where models live, which network interfaces it listens on, GPU picks, memory rules, proxy routing. You don’t touch code. You just set a key-value pair, restart Ollama, and the new config kicks in.

The docs group settings into a few buckets. Server config (bind address, model directory). GPU config (CUDA, ROCm, Vulkan device selection). Model behavior (context size, attention tricks, cache precision). Advanced debugging stuff (logs, remote registries, experimental engines). And proxy config for corporate networks.

Every variable change needs a full restart. On Linux, that’s systemctl restart ollama.service after you edit the unit file. On macOS, quit and relaunch the app after running launchctl setenv. On Windows, close the tray icon and reopen after updating variables through the GUI or setx.

Here’s what you’ll configure most often:

Server networking and binding – OLLAMA_HOST, OLLAMA_ORIGINS, default bind is 127.0.0.1, switch to 0.0.0.0 for network access
Model storage paths – OLLAMA_MODELS defaults to ~/.ollama/models on macOS/Linux, C:\Users\<username>\.ollama\models on Windows
Concurrency and queue – OLLAMA_NUM_PARALLEL (defaults to 1–4 dynamically), OLLAMA_MAX_QUEUE (512), OLLAMA_MAX_LOADED_MODELS (3× GPU count or 3 for CPU)
Keep-alive memory policy – OLLAMA_KEEP_ALIVE defaults to “5m”, accepts “10m”, “24h”, 3600, -1 (infinite), 0 (unload now)
GPU device selection – CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, GGML_VK_VISIBLE_DEVICES
GPU tuning – OLLAMA_GPU_OVERHEAD, OLLAMA_SCHED_SPREAD, OLLAMA_VULKAN
Model context and caching – OLLAMA_CONTEXT_LENGTH, OLLAMA_FLASH_ATTENTION, OLLAMA_KV_CACHE_TYPE (default f16, options q8_0, q4_0)
Proxy support – HTTPS_PROXY, NO_PROXY (don’t set HTTP_PROXY)
Advanced toggles – OLLAMA_DEBUG, OLLAMA_LLM_LIBRARY, OLLAMA_NOPRUNE, OLLAMA_NOHISTORY, OLLAMA_EDITOR, OLLAMA_REMOTES, OLLAMA_NO_CLOUD, OLLAMA_NEW_ENGINE
Load timeout – OLLAMA_LOAD_TIMEOUT for model load cutoff
Multi-user caching – OLLAMA_MULTIUSER_CACHE to share cache across users
HSA override – HSA_OVERRIDE_GFX_VERSION for ROCm compute capability emulation

Typical export syntax on Unix:

export OLLAMA_HOST=0.0.0.0
export OLLAMA_KEEP_ALIVE="10m"
export OLLAMA_NUM_PARALLEL=8

After setting variables in your shell or systemd override, restart Ollama to activate the new config.

Configuring Ollama Environment Variables for Server Networking and API Binding

QEt3R8mmSPOfYheaqs8bkQ

Out of the box, Ollama binds to 127.0.0.1. That locks API access to the local machine. To open it up across the network (say, for LAN clients or remote tools), set OLLAMA_HOST to 0.0.0.0. This tells the server to listen on all available interfaces. Pair network exposure with firewall rules and auth, because an open bind means anyone who can reach your host IP can hit the API.

OLLAMA_ORIGINS controls which web origins can make cross-origin requests. The default whitelist covers 127.0.0.1 and 0.0.0.0. If you’re serving a web interface from a different domain or port, add that origin to the allowed list. Missing origins will cause CORS preflight checks to fail, blocking browser-based clients from /api/generate or /api/chat.

Common server-side config steps:

Set OLLAMA_HOST to the bind address you want (0.0.0.0 for network-wide access).
Adjust OLLAMA_ORIGINS to include any extra domains or ports that need browser access.
Verify firewall rules permit traffic on the listening port.
Restart Ollama to apply the new bind and origin settings.
Test connectivity from a remote machine using curl against the new host address.
Check server logs if connections fail. Binding errors and permission issues show up right away at startup.

Managing Model Storage and File Paths with Ollama Variables

ZqaRcM5CRkWUkSb2GnRJ_Q

Ollama stores downloaded models in a default directory that varies by OS: ~/.ollama/models on macOS and most Linux installs, /usr/share/ollama/.ollama/models on some Linux distros, and C:\Users\<username>\.ollama\models on Windows. These paths eat disk space in proportion to the number and size of models you pull. Large models can consume tens of gigabytes, so dropping model storage on a dedicated volume or external drive is common.

To relocate model storage, set OLLAMA_MODELS to an absolute path pointing to your preferred directory. On Linux, make sure the ollama system user has read and write permissions. Run sudo chown -R ollama:ollama <directory> after creating or moving the folder. Without correct ownership, the server will fail to download or load models and log permission-denied errors.

Default model storage locations across operating systems:

macOS / Linux user installs – ~/.ollama/models
Linux systemd service – /usr/share/ollama/.ollama/models
Windows – C:\Users\<username>\.ollama\models
Custom path – Any directory specified via OLLAMA_MODELS, just verify ownership and permissions

Performance‑Related Ollama Environment Variables for Parallelism and Model Loading

BK6Xy-o2QXSyG0Kzt4mwkg

Ollama picks concurrency defaults based on available memory and GPU count. OLLAMA_NUM_PARALLEL controls how many requests a single loaded model processes at once. The default sits between 1 and 4, chosen at startup by measuring free VRAM and system RAM. Bump this value to raise throughput when you’ve got spare memory. Lower it to prevent out-of-memory errors under heavy load or when running large models. Each parallel request eats a chunk of the model’s KV cache, so doubling parallelism roughly doubles memory overhead.

OLLAMA_MAX_LOADED_MODELS caps how many distinct models stay in memory at once. The default is three times the number of detected GPUs, or 3 if running CPU only. When you exceed this limit, Ollama unloads the least-recently-used model to make room. Raising the cap lets you switch between models fast without reload delays, at the cost of higher baseline memory consumption. Lowering it frees memory for larger context windows or additional parallel requests within a single model.

OLLAMA_MAX_QUEUE defines the max number of queued requests before the server starts returning HTTP 503 responses. The default of 512 works for most deployments. Increase it if legit traffic spikes cause queue overflows, or reduce it to fail fast under sustained overload rather than building a long backlog.

Variable	Purpose	Default
OLLAMA_NUM_PARALLEL	Parallel requests per model	Dynamic (1–4)
OLLAMA_MAX_LOADED_MODELS	Concurrent loaded models	3× GPUs or 3 (CPU)
OLLAMA_MAX_QUEUE	Queued request limit	512
OLLAMA_KEEP_ALIVE	Model residency time	5m

GPU and Hardware Acceleration Environment Variables for Ollama

zt0yv7dWQjGRG9zUofY9JA

Ollama detects available GPUs and assigns them automatically, but environment variables let you override device visibility and scheduler behavior. CUDA_VISIBLE_DEVICES restricts which NVIDIA GPUs the process can see. Set it to a comma-separated list of device IDs to isolate Ollama to specific cards. For example, CUDA_VISIBLE_DEVICES=0,2 exposes only devices 0 and 2, hiding device 1 from the runtime. Useful when sharing a multi-GPU server across multiple services.

ROCR_VISIBLE_DEVICES does the same thing for AMD GPUs under ROCm. HSA_OVERRIDE_GFX_VERSION forces the runtime to treat an AMD GPU as a different compute capability, often used to run models on unsupported hardware by emulating a supported GFX version. GGML_VK_VISIBLE_DEVICES applies to Vulkan-accelerated inference, specifying which Vulkan-compatible devices GGML should use. OLLAMA_VULKAN toggles Vulkan acceleration on or off entirely.

GPU-related variables include:

CUDAVISIBLEDEVICES – Limit visible NVIDIA GPUs, example: 0,1 to use first two cards
ROCRVISIBLEDEVICES – Limit visible AMD ROCm GPUs, syntax mirrors CUDA
HSAOVERRIDEGFXVERSION – Override AMD compute capability, example: 10.3.0
GGMLVKVISIBLEDEVICES – Select Vulkan devices for GGML, comma-separated indices
OLLAMAVULKAN – Enable or disable Vulkan backend, set to 1 to enable
OLLAMAGPUOVERHEAD – Reserve extra VRAM headroom, helpful when models underestimate memory
OLLAMASCHED_SPREAD – Influence GPU scheduler distribution, tuning for multi-GPU setups

Configuring Model Behavior with Ollama Context and Attention Variables

19jUMDLpQQKJTU0eNCmjTw

OLLAMA_CONTEXT_LENGTH sets the max token window a model can process in a single request. Larger context windows increase memory consumption linearly, because the KV cache grows with every additional token. If you regularly hit context limits or need to process longer documents, raise this value. But monitor VRAM usage to avoid out-of-memory crashes.

OLLAMA_FLASH_ATTENTION enables an optimized attention mechanism that cuts memory overhead when working with large contexts. Set it to 1 to activate. Flash Attention is especially effective above 8k tokens, where standard attention memory use starts to dominate. OLLAMA_KV_CACHE_TYPE controls the precision of the key-value cache. The default f16 (16-bit float) offers full precision. Switching to q8_0 (8-bit quantized) or q4_0 (4-bit quantized) cuts memory use in half or by 75%, at a small accuracy cost. Most workloads tolerate q8_0 without noticeable degradation.

KV Cache Type	Precision	Memory Impact
f16	16-bit float	Baseline (default)
q8_0	8-bit quantized	~50% reduction
q4_0	4-bit quantized	~75% reduction

Advanced and Debugging‑Focused Ollama Environment Variables

U2BGaOLoT5-PR8lTcrvbxg

OLLAMA_DEBUG enables verbose logging, printing detailed request traces, model load steps, and internal state transitions to stdout or the system journal. Turn it on when diagnosing crashes, slow responses, or unexpected behavior. The output can be noisy, so enable debug mode only during active troubleshooting.

OLLAMA_LLM_LIBRARY forces Ollama to load a specific shared library for LLM inference, bypassing the default library selection. Rarely needed outside custom builds or experimental patches. OLLAMA_NOPRUNE prevents automatic cleanup of unused model layers or blobs, which can speed up repeated model switches at the cost of disk space. OLLAMA_NOHISTORY disables conversation history persistence, useful in stateless or ephemeral environments where you don’t want prior sessions logged to disk.

OLLAMA_EDITOR sets the default text editor for interactive prompts or model editing workflows. OLLAMA_REMOTES specifies alternative model registries or repositories, letting you pull models from private or mirrored sources. OLLAMA_NO_CLOUD blocks all outbound cloud service integrations, ensuring Ollama runs in fully air-gapped mode. OLLAMA_NEW_ENGINE toggles experimental inference engines before they graduate to stable. Use it to test upcoming features or performance improvements ahead of official releases.

Proxy and Network Routing Variables for Ollama Deployments

CUuKoTvJTTq1lBnOI8NF1A

Corporate and institutional networks often require traffic to flow through an HTTP or HTTPS proxy. HTTPS_PROXY tells Ollama where to route model download requests. Set it to your proxy’s address and port, for example, HTTPS_PROXY=http://proxy.company.com:8080. Unlike typical proxy configs, Ollama’s docs explicitly warn against setting HTTP_PROXY, because it can interfere with client connections to the API server itself.

NO_PROXY defines a bypass list for addresses that should skip the proxy. Use it to exclude internal hostnames, local IP ranges, or the Ollama server’s own bind address from proxy routing. When running behind a corporate proxy, install the proxy’s root certificate as a system-trusted certificate so Ollama can verify TLS connections to model repositories without certificate errors.

Best practices for proxy configuration:

Set only HTTPS_PROXY, not HTTP_PROXY, to avoid disrupting API client traffic
Add local and loopback addresses to NO_PROXY to prevent routing loops
Install the corporate proxy certificate chain system-wide before first model download
Test model pulls after changing proxy settings to confirm certificate validation succeeds
Review firewall egress rules to ensure the proxy itself can reach model registries

OS‑Specific Instructions for Setting Ollama Environment Variables

LrITBmraQ4KpWrTRgm8Zmg

macOS

On macOS, Ollama runs as a standard application, not a system service. Environment variables must be set using launchctl, which modifies the per-user launchd environment. After setting a variable, quit the Ollama app completely (from the menu bar or via Force Quit) and relaunch it.

Steps to configure on macOS:

Open Terminal.
Run launchctl setenv OLLAMA_HOST "0.0.0.0" (or whichever variable you need).
Quit Ollama from the menu bar.
Restart Ollama to load the new environment.

For shell session overrides, you can also add export OLLAMA_HOST=0.0.0.0 to ~/.bash_profile or ~/.zshrc and run source ~/.bash_profile, but this only affects processes launched from that shell, not the GUI application.

Linux

Linux deployments typically run Ollama as a systemd service. To set environment variables persistently, edit the service unit using systemctl edit ollama.service, which opens an override file. Add your variables under the [Service] section using Environment= directives. After saving, reload the systemd daemon and restart the service.

Steps to configure on Linux:

Run sudo systemctl edit ollama.service.
Add lines like Environment="OLLAMA_HOST=0.0.0.0" under [Service].
Save and close the editor.
Run sudo systemctl daemon-reload to apply changes.
Run sudo systemctl restart ollama to restart with new settings.
If changing OLLAMA_MODELS, ensure ownership: sudo chown -R ollama:ollama /new/path.

Verify the variable took effect with systemctl show ollama.service --property=Environment or by checking echo $OLLAMA_HOST in a shell spawned by the service.

For more details, see the official Environment Variables reference and Setting Up Environment Variables for Ollama.

Windows

Windows Ollama inherits both user and system environment variables. The recommended approach is to use the graphical environment variable editor: open Settings (Windows 11) or Control Panel (Windows 10), search for “environment variables,” and click “Edit environment variables for your account.” Add or modify variables in the user section, click OK, then quit Ollama from the system tray and restart it.

Steps to configure on Windows:

Open Settings, search “environment variables,” “Edit environment variables for your account.”
Click “New” and enter the variable name (e.g., OLLAMA_HOST) and value (e.g., 0.0.0.0).
Click OK to save.
Right-click the Ollama tray icon and select Quit.
Relaunch Ollama from the Start menu or desktop shortcut.

Alternatively, use Command Prompt: setx OLLAMA_HOST "0.0.0.0". After running setx, close and reopen Ollama. Remember that setx writes to the registry and affects future sessions, while set only modifies the current CMD window.

Practical Examples of Ollama Environment Variable Configurations

7UegSYTRn66GXqweJ1RnQ

Production Server Configuration
Bind to all interfaces, increase queue depth, set longer keep-alive to reduce reload overhead, and pin models to a dedicated SSD:

export OLLAMA_HOST=0.0.0.0
export OLLAMA_MAX_QUEUE=1024
export OLLAMA_KEEP_ALIVE=30m
export OLLAMA_MODELS=/mnt/ssd/ollama-models

Development Configuration
Bind to localhost, enable debug logging, unload models immediately after each response to free memory for other tools:

export OLLAMA_HOST=127.0.0.1
export OLLAMA_DEBUG=1
export OLLAMA_KEEP_ALIVE=0

Multi-GPU Setup
Expose first two NVIDIA GPUs, spread scheduling across them, and allow higher parallelism:

export CUDA_VISIBLE_DEVICES=0,1
export OLLAMA_SCHED_SPREAD=1
export OLLAMA_NUM_PARALLEL=8
export OLLAMA_MAX_LOADED_MODELS=6

CPU-Only Mode
Disable all GPU acceleration, reduce parallelism to match CPU cores, and use quantized KV cache to save RAM:

export OLLAMA_VULKAN=0
export OLLAMA_NUM_PARALLEL=4
export OLLAMA_KV_CACHE_TYPE=q8_0

Use production config for always-on inference endpoints, development config during model testing and iteration, multi-GPU config to maximize hardware utilization on workstations, and CPU-only mode when running on cloud instances without GPU access.

For additional setup guidance and configuration examples, see Configuring Your Ollama Server.

Final Words

You now have a compact map of every Ollama setting — host and CORS rules, model paths, performance and parallelism knobs, GPU and Vulkan flags, context/attention controls, debugging options, proxy tips, OS setup, and example configs.

Remember: most changes need a restart and some paths need ownership tweaks. Use the examples to test safely and tune OLLAMAKEEPALIVE, OLLAMANUMPARALLEL, and CUDAVISIBLEDEVICES for your workload.

Treat this as your quick reference for ollama environment variables — copy, test, and ship with more confidence.

FAQ

Q: What are Ollama environment variables and do changes require a restart?

A: Ollama environment variables configure host, GPU, storage, performance, and debugging; most changes require restarting the Ollama service for the new values to take effect.

Q: How do OLLAMAHOST and OLLAMAORIGINS affect API binding and security?

A: OLLAMAHOST sets the API bind address (default 127.0.0.1; use 0.0.0.0 for network access). OLLAMAORIGINS controls allowed web origins; both require a restart after changes.

Q: Where does Ollama store models and how do I change the model directory?

A: OLLAMA_MODELS points to the model folder; defaults are ~/.ollama/models on macOS/Linux and C:\Users\.ollama\models on Windows. Change the path, fix ownership (chown), then restart Ollama.

Q: How do OLLAMANUMPARALLEL, OLLAMAMAXLOADEDMODELS and OLLAMAMAX_QUEUE affect performance?

A: These variables control concurrency and memory: OLLAMANUMPARALLEL sets parallel workers (dynamic 1–4), OLLAMAMAXLOADEDMODELS limits models in RAM (default 3× GPUs or 3), and OLLAMAMAX_QUEUE caps queued requests (default 512).

Q: What formats does OLLAMAKEEPALIVE accept and what do common values mean?

A: OLLAMAKEEPALIVE accepts durations like “5m”, “10m”, “24h”, numeric seconds (3600), -1 to keep models loaded indefinitely, and 0 to unload models immediately.

Q: How do I select GPUs and control device visibility for Ollama?

A: Use CUDAVISIBLEDEVICES or ROCRVISIBLEDEVICES to whitelist GPUs, GGMLVKVISIBLEDEVICES for Vulkan, and HSAOVERRIDEGFXVERSION for ROCm compatibility; set these before starting Ollama.

Q: What do OLLAMAVULKAN and OLLAMAGPU_OVERHEAD control?

A: OLLAMAVULKAN enables the Vulkan backend; OLLAMAGPU_OVERHEAD reserves GPU memory for driver/system overhead to reduce out-of-memory errors. Tweak values based on model size and GPU free memory, then restart.

Q: How do OLLAMACONTEXTLENGTH, OLLAMAFLASHATTENTION and OLLAMAKVCACHE_TYPE change model behavior?

A: OLLAMACONTEXTLENGTH sets the token window size; OLLAMAFLASHATTENTION reduces memory use and can speed inference; OLLAMAKVCACHETYPE sets cache precision (default f16; q80 and q4_0 available).

Q: Which proxy variables should I set for corporate networks, and what pitfalls exist?

A: Set HTTPSPROXY when behind a corporate proxy, use NOPROXY for local bypasses, avoid setting HTTP_PROXY unless required, and install proxy certificates system-wide to prevent TLS failures.

Q: Which advanced or debug variables help with troubleshooting or custom engines?

A: OLLAMADEBUG enables verbose logging; OLLAMALLMLIBRARY switches backend; OLLAMANOPRUNE, OLLAMANOHISTORY, and OLLAMAREMOTES toggle pruning, history, and remote models. Change, restart, and inspect logs.

Q: How do I set Ollama environment variables on macOS, Linux, and Windows?

A: On macOS use launchctl setenv and restart the app. On Linux add Environment= lines to the systemd unit, daemon-reload, then restart and chown paths. On Windows use the GUI or setx and restart Ollama.

Q: Can you give example configurations for production, development, multi-GPU, and CPU-only setups?

A: Production: OLLAMAHOST=0.0.0.0 and tuned maxloadedmodels. Development: localhost binding and lower resources. Multi-GPU: set CUDAVISIBLEDEVICES and increase maxloaded_models. CPU-only: unset GPU vars and use CPU scheduling.