AI: Ollama

2 분 소요

개요

로컬 환경에서 대규모 언어 모델(LLM)을 손쉽게 실행·관리하는 오픈소스 도구
“Docker for LLM” 컨셉 — 명령 한 줄로 모델 다운로드·실행·서빙
macOS / Linux / Windows 멀티플랫폼, GPU(CUDA / Metal / ROCm) 자동 활용
공식 사이트 / GitHub

특징

간단한 설치·사용: 단일 바이너리, 의존성 최소
로컬 실행: 데이터가 외부로 나가지 않음 → 프라이버시·규제 대응
OpenAI 호환 API: 기존 OpenAI SDK 코드를 base_url만 바꿔 재사용
모델 라이브러리: Llama, Qwen, Phi, Mistral, DeepSeek-R1, Gemma 등 주요 오픈 모델 큐레이션
양자화(GGUF) 기본: 4bit / 8bit 등 양자화 모델로 소비자 GPU·CPU에서 실행
Modelfile: Dockerfile 유사 문법으로 시스템 프롬프트·파라미터·LoRA 합친 커스텀 모델 정의

설치

macOS / Windows: 공식 사이트 설치 패키지

Linux

curl -fsSL https://ollama.com/install.sh | sh

Docker

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

기본 명령

명령	설명
`ollama pull <model>`	모델 다운로드
`ollama run <model>`	대화형 실행 (없으면 자동 pull)
`ollama serve`	API 서버 실행 (기본 포트 11434)
`ollama list`	로컬 모델 목록
`ollama ps`	실행 중 모델 (메모리 점유 확인)
`ollama rm <model>`	모델 삭제
`ollama create <name> -f Modelfile`	Modelfile로 커스텀 모델 빌드

Modelfile

베이스 모델, 시스템 프롬프트, 파라미터, 템플릿, LoRA 어댑터 등을 선언적으로 정의

예시

FROM llama3.2

PARAMETER temperature 0.7
PARAMETER num_ctx 8192

SYSTEM """
당신은 한국어로 답변하는 친절한 코딩 도우미입니다.
"""

TEMPLATE """

User: 
Assistant:"""

빌드·실행

ollama create my-coder -f Modelfile
ollama run my-coder

REST API

기본 엔드포인트: http://localhost:11434
주요 엔드포인트: /api/generate, /api/chat, /api/embeddings, /api/tags

채팅 호출 예시

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "안녕"}],
  "stream": false
}'

OpenAI 호환 API

/v1/chat/completions, /v1/embeddings, /v1/models 제공

OpenAI Python SDK 그대로 사용 가능

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # 임의 값
)

resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Ollama가 뭐야?"}],
)
print(resp.choices[0].message.content)

Python 클라이언트 (ollama-python)

  import ollama

  resp = ollama.chat(
      model="llama3.2",
      messages=[{"role": "user", "content": "한 줄 요약"}],
  )
  print(resp["message"]["content"])

  # 스트리밍
  for chunk in ollama.chat(model="llama3.2", messages=[...], stream=True):
      print(chunk["message"]["content"], end="", flush=True)

임베딩

nomic-embed-text, mxbai-embed-large, bge-m3 등 임베딩 모델 지원

RAG 파이프라인의 임베딩 레이어를 로컬에서 실행 가능

ollama pull nomic-embed-text

curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "검색용 문서 텍스트"
}'

비교

도구	특징
Ollama	CLI·서버 통합, OpenAI 호환 API, Modelfile, 사용 쉬움
llama.cpp	저수준 추론 엔진, GGUF 표준, 최고 호환성
LM Studio	GUI 기반 로컬 LLM 실행, 비개발자 친화
vLLM	서버급 고성능 서빙(PagedAttention), 프로덕션 지향
Text Generation Inference (TGI)	Hugging Face 프로덕션 서빙

활용 사례

로컬 RAG / 개인 지식베이스 (Open WebUI, AnythingLLM, LM Studio 연동)
코드 어시스턴트 (Continue, Cline, Aider 등 IDE 확장에서 백엔드로 사용)
오프라인·온프레미스 챗봇
AI 에이전트 프로토타이핑 (LangChain / LlamaIndex / CrewAI 의 LLM 백엔드)

chp

AI: Ollama

개요

특징

설치

기본 명령

Modelfile

REST API

OpenAI 호환 API

Python 클라이언트 (ollama-python)

임베딩

비교

활용 사례

관련 포스트

공유하기

참고

WASM: DuckDB WASM

WASM: SQLite WASM

WASM: WebAssembly 개요

WASM: 목차