Local AI Experimental

⚠️

Experimental — the ai API may change before v1.0. Core functionality is available but the API surface is not yet stable.

VeloxKit bundles llama.cpp (opens in a new tab) for on-device inference. No network required, no API keys.

Requires capability: "ai"

Setup

Place model files in assets/models/:

assets/models/
├── nomic-embed-text-v1.5.Q4_K_M.gguf    ← embedding model (~80MB)
└── llama-3.2-1b-instruct.Q4_K_M.gguf   ← generation model (~900MB)

export default defineConfig({
  capabilities: ['ai', 'db'],
  ai: {
    models: [
      {
        name: 'embedder',
        path: './assets/models/nomic-embed-text-v1.5.Q4_K_M.gguf',
        type: 'embedding',
      },
      {
        name: 'generator',
        path: './assets/models/llama-3.2-1b-instruct.Q4_K_M.gguf',
        type: 'generation',
      },
    ],
  },
})

Embeddings

Generate a vector embedding for text:

import { ai } from 'veloxkit'
 
const embedding = await ai.embed('embedder', 'My note content here')
// → Float32Array (1024 dimensions for nomic-embed-text-v1.5)

Semantic search with sqlite-vec

Store embeddings and retrieve similar documents:

import { ai, db } from 'veloxkit'
 
// Setup (run once in initDatabase)
db.execute(`
  CREATE VIRTUAL TABLE IF NOT EXISTS note_embeddings
  USING vec0(embedding FLOAT[1024])
`)
 
// Index a note
async function indexNote(noteId: number, content: string) {
  const embedding = await ai.embed('embedder', content)
  db.execute(
    'INSERT OR REPLACE INTO note_embeddings(rowid, embedding) VALUES (?, ?)',
    [noteId, embedding]
  )
}
 
// Search
async function semanticSearch(query: string, limit = 5) {
  const queryEmbedding = await ai.embed('embedder', query)
 
  return db.query<{ id: number; title: string; distance: number }>(
    `SELECT notes.id, notes.title, ne.distance
     FROM note_embeddings ne
     JOIN notes ON notes.id = ne.rowid
     WHERE ne.embedding MATCH ?
       AND k = ?
     ORDER BY ne.distance`,
    [queryEmbedding, limit]
  )
}

Text generation

Stream a response from a local LLM:

import { ai } from 'veloxkit'
import { useState } from 'react'
 
function AIAssistant() {
  const [response, setResponse] = useState('')
  const [loading, setLoading] = useState(false)
 
  async function ask(prompt: string) {
    setLoading(true)
    setResponse('')
 
    const stream = ai.generate('generator', {
      prompt,
      maxTokens: 512,
      temperature: 0.7,
      systemPrompt: 'You are a helpful assistant. Be concise.',
    })
 
    for await (const chunk of stream) {
      setResponse(prev => prev + chunk)
    }
 
    setLoading(false)
  }
 
  return (
    <View style={{ flex: 1, padding: 16, gap: 12 }}>
      <Pressable
        onPress={() => ask('Summarize my recent notes')}
        style={{ padding: '10px 16px', background: '#00A878', borderRadius: 6 }}
      >
        <Text style={{ color: '#fff' }}>Ask AI</Text>
      </Pressable>
      {loading && <Text style={{ color: '#666677' }}>Generating...</Text>}
      <Text>{response}</Text>
    </View>
  )
}

Transcription (Whisper)

⚠️

Transcription requires a Whisper GGUF model. Coming in v0.4.

// Coming in v0.4
const transcript = await ai.transcribe('whisper', audioBuffer)

Model loading

Models load lazily on first use. You can preload them:

import { ai } from 'veloxkit'
 
// Preload on app start to avoid first-use latency
await ai.preload('embedder')

Check load status:

const status = ai.modelStatus('embedder')
// → 'unloaded' | 'loading' | 'ready' | 'error'

Performance notes

Embedding a short text: ~5–20ms on Apple M2
Generating 100 tokens: ~800ms on Apple M2 (1B model, Q4)
Models are loaded into memory once and kept loaded until the app exits
GPU acceleration (Metal/Vulkan) is automatic when available

Database Patterns Canvas 2D