49 lines
1.3 KiB
Markdown
49 lines
1.3 KiB
Markdown
# Ask Annie — ST Best Practices Session Ingestion
|
|
|
|
Ingestion pipeline for Axway MFT User Group "Ask Annie" Q&A sessions on Vimeo.
|
|
|
|
## What it does
|
|
|
|
1. Downloads audio from a Vimeo URL via yt-dlp
|
|
2. Transcribes with Whisper (timestamped segments)
|
|
3. Slices transcript into per-chapter chunks using a chapters JSON file
|
|
4. Optionally extracts frames from demo-heavy chapters for vision annotation
|
|
5. Outputs `chunks.json` ready for ingestion into knowledge-mcp
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
python3 ingest.py \
|
|
--url 'https://vimeo.com/1020102626' \
|
|
--chapters chapters/1020102626.json \
|
|
--out ./out \
|
|
--whisper-model medium
|
|
```
|
|
|
|
Add `--frames` to also extract video frames for demo chapters (requires video download).
|
|
|
|
## Dependencies
|
|
|
|
```bash
|
|
brew install yt-dlp ffmpeg
|
|
pip install openai-whisper
|
|
```
|
|
|
|
## Repo structure
|
|
|
|
```
|
|
ingest.py # Main pipeline script
|
|
chapters/<video_id>.json # Chapter list per session
|
|
out/<video_id>/ # Output (gitignored)
|
|
audio.mp3
|
|
transcript.json
|
|
chunks.json
|
|
frames/
|
|
```
|
|
|
|
## Adding a new session
|
|
|
|
1. Create `chapters/<video_id>.json` with timestamp + title + summary per chapter
|
|
2. Run `ingest.py --url <vimeo_url> --chapters chapters/<video_id>.json`
|
|
3. Review `out/<video_id>/chunks.json`
|
|
4. Ingest chunks into knowledge-mcp notebook `securetransport-md` |