User Manual
A comprehensive guide to using Voiceworks Toolkit for Japanese language study.
Contents
Getting Started
After installing the userscript, browse to the platform. The toolkit activates automatically and you'll notice several enhancements right away:
- A new sidebar menu with Radio Mode and Playlist Mode toggles
- Keyboard shortcuts for playback control
- Infinite scroll replacing pagination
- Translated tags and UI elements (when translation services are enabled)
- Progress tracking checkmarks on work cards
Most features work out of the box. AI-powered features need light setup in Settings: Whisper requires a one-time model download, and semantic search requires a Jina API key.
Settings & Configuration
Access the settings panel from the Settings page. The toolkit adds its own configuration sections alongside the native settings.
General Settings
- Language: Switch between English, Chinese, and Japanese UI localization
- SFW Mode: Hide all images and thumbnails for use in public environments
- Infinite Scroll: Toggle automatic page loading on scroll
AI Settings
- Whisper Model: Choose the transcription model size (tiny, small, medium) and quantization level. Larger models are more accurate but use more memory
- Whisper Language: Set the default transcription language (Japanese, English, Chinese, Korean, etc.)
- Translation Settings: Configure web translation behavior and caching controls
- Jina API Key: Enter your free Jina API key to enable semantic vector search
Playback Settings
- Shuffle: Enable or disable shuffle for Radio and Playlist modes
- Auto-advance: Automatically move to the next work when the current one finishes
Learner Mode
Learner Mode displays dual-language subtitles during playback, designed for immersion-based Japanese study.
How It Works
- Navigate to any voicework with available subtitle files (LRC format) or enable Whisper for live transcription
- The primary line shows Japanese text (kana/kanji)
- The secondary line shows English translation, blurred by default
- Hover over the English line or press
Bto reveal the translation
Subtitle Sources
Learner Mode uses subtitles from these sources, in priority order:
- LRC files: Pre-existing lyric files bundled with the voicework
- Whisper transcription: Live speech-to-text when enabled (overrides static subtitles)
- Cached transcripts: Previously transcribed content is loaded instantly
Chinese Content
When Chinese-language subtitles are detected (CJK text without kana), they are automatically translated to Japanese via remote translation so the primary line always shows Japanese text. This ensures learners studying Japanese always see their target language first.
Tip: Use playback speed controls ([ and ] keys) to slow down audio during study. The lead-time setting in the player lets you read ahead before each line is spoken.
Live Transcription (Whisper)
The live transcription feature uses the Whisper speech recognition model to generate subtitles in real-time from any audio being played.
Enabling Transcription
- Click the microphone icon in the player controls
- On first use, the Whisper model will download (~150 MB). This is cached for future use
- Once loaded, transcription begins automatically as audio plays
Model Selection
Choose the appropriate model size in Settings based on your hardware:
- Tiny: Fastest, lowest memory, good for real-time use on modest hardware
- Small: Better accuracy, recommended for systems with WebGPU support
- Medium: Best accuracy, requires a capable GPU and sufficient memory
Caching
Transcripts are automatically cached per-track with a 90-day TTL. When you revisit a previously transcribed voicework, subtitles appear instantly without re-running the model.
Exporting
Download transcripts in standard subtitle formats:
- LRC: Lyric format, compatible with most music players and study tools
- VTT: WebVTT format, standard for web video subtitles
- SRT: SubRip format, widely supported by media players and Anki
Download buttons appear in the file tree and flat view after transcription completes.
Translation System
The toolkit includes two translation mechanisms that work together to provide comprehensive English translation across the interface.
Neural Translation (Web Pipeline)
Translation uses a web pipeline with host rotation, retry/backoff, in-flight deduplication, and shared caching. It translates:
- Player track titles (Japanese/Chinese to English)
- Content tags across all pages
- Work card titles in grid and list views
- Circle and voice actor names
No local translation model download is required. Cached translations are reused across features for faster repeat lookups.
Interface Translation (Static)
A built-in translation map converts static UI elements (buttons, menus, sort options, labels) to English. This works immediately without any model download and covers the complete interface.
Performance
Translation requests are batched in an 8ms coalescing window and deduplicated to prevent redundant work. Single-text requests (like the currently playing track title) are prioritized over batch operations for lower visible latency.
Search & Discovery
Semantic Search
Semantic search lets you find voiceworks by meaning rather than exact keywords. This is particularly useful for finding content by theme or topic when you don't know the exact Japanese title.
- Enter your Jina API key in Settings (free tier available)
- Click the search icon in the header to open the Semantic Search dialog
- Type your query in any language
- Results are ranked by semantic similarity to your query
The vector index builds automatically in the background. You can also trigger manual indexing from the dialog. Embeddings are stored locally in IndexedDB and persist across sessions.
Advanced Search
The Advanced Search panel provides structured filtering:
- Tags: Filter by content tags with AND/OR logic
- Circle: Search by creator/circle name
- Voice Actor: Filter by voice actor
- Date Range: Limit results to a time period
- Rating & Price: Set minimum rating or price filters
Search history is saved for quick re-use of frequent queries.
Tag Filters
Click any tag on a work card or detail page to instantly filter by it. Active filters appear as removable chips and persist across navigation. Multi-tag filtering is supported, so you can click additional tags to narrow results.
Playback Modes
Radio Mode
Radio Mode provides continuous shuffled playback across your library, ideal for extended immersion sessions.
- Toggle Radio Mode from the sidebar menu
- Playback begins automatically with a randomly selected voicework
- When one work finishes, the next is selected randomly
- A short history buffer prevents frequent repetition
- Health-checking and auto-recovery keep the stream running through interruptions
Playback state persists across page refreshes. Radio Mode is mutually exclusive with Playlist Mode.
Playlist Mode
Playlist Mode enables sequential playback through curated collections.
- Open the Playlist Discovery panel from the sidebar
- Browse or search community-curated playlists
- Activate a playlist to begin sequential playback
- Use the forward/back controls in the player bar to navigate between works
Playlists auto-advance to the next voicework when the current one finishes.
Media Viewer
The Media Viewer provides a lightbox gallery for images and video files bundled with voiceworks.
Supported Formats
- Images: JPG, PNG, GIF, WebP
- Video: MP4, WebM, MOV, AVI, MKV
- Documents: PDF, TXT, SRT
Controls
- Click any media file to open the lightbox
- Use arrow keys or swipe to navigate between files
- Press
ESCto close - Enable slideshow mode for auto-advance
Keyboard Shortcuts
All shortcuts are disabled when focus is in a text input field.
| Key | Action | Context |
|---|---|---|
Space / K | Play / Pause | Player |
M | Mute / Unmute | Player |
F | Fullscreen toggle | Player |
Left Arrow | Seek back 5 seconds | Player |
Right Arrow | Seek forward 5 seconds | Player |
Shift + Left | Seek back 30 seconds | Player |
Shift + Right | Seek forward 30 seconds | Player |
Up Arrow | Volume up 5% | Player |
Down Arrow | Volume down 5% | Player |
[ | Decrease playback speed | Player |
] | Increase playback speed | Player |
0–9 | Jump to 0%–90% of track | Player |
B | Toggle English subtitle blur | Learner Mode |
J | Toggle Japanese subtitles | Learner Mode |
ESC | Close lightbox / Exit fullscreen | Media Viewer |
Left / Right | Previous / Next image | Media Viewer |
Backup & Restore
Export all your settings and preferences to a JSON file for safekeeping or transfer to another browser.
Export
- Open Settings
- Click Export Settings
- Save the downloaded JSON file
Import
- Open Settings
- Click Import Settings
- Select your previously exported JSON file
- Settings are applied immediately
Note: The export includes preferences, feature toggles, and configuration values. It does not include cached data (transcripts, translations, embeddings) as these are regenerated automatically.
FAQ
Does the toolkit send my data to external servers?
Whisper transcription runs on-device, while translation and semantic embeddings use external APIs. In practice, external calls include Google Translate endpoints for translation and Jina embeddings for semantic search when you provide an API key. Results are cached locally in your browser.
What browsers are supported?
Any modern Chromium-based browser (Chrome, Edge, Brave) or Firefox with Tampermonkey installed. WebGPU support (Chrome 113+, Edge 113+) is recommended for best Whisper performance, but the toolkit falls back to WASM automatically.
How much disk space do AI assets use?
The Whisper model is the main local download at roughly ~150 MB depending on model choice and quantization. Translation does not require a local model download. Cached transcripts, translations, and embeddings also use browser storage over time.
Can I use custom Whisper models?
Yes. The model selection field in Settings accepts arbitrary Hugging Face model IDs. Type a custom model ID (e.g., from the onnx-community namespace) and the worker will attempt to load it. Suggested models are provided in a dropdown for convenience.
Why are translations slow on first load?
First requests may be slower because the service is warming caches and making fresh network calls. After that, shared translation caching makes repeated text much faster.
How do I report bugs or request features?
Open an issue on the GitHub Issues page. Include your browser version, GPU model (if AI-related), and steps to reproduce.