Insight-Meditation-Center-Talks

Insight Meditation Center Talks

This project downloads video transcripts from YouTube and scrapes talk metadata from audiodharma.org. It processes the raw transcripts using a Generative AI to produce cleaned, formatted, and enriched markdown files suitable for a personal knowledge base.

The transcription articles generated by this project are browsable via a generated HTML interface here.

TODO

How It Works

The project has two main data sources that it synthesizes:

  1. audiodharma.org: The script scrapes this site to build a local cache of talk metadata (titles, speakers, series) and speaker information. This is considered the primary source of truth for talk details.
  2. YouTube: The script uses the Insight Meditation Center’s YouTube channel to fetch video metadata and, most importantly, the raw video transcripts.

The core logic, executed by the scrape_and_generate command, links a YouTube video to its corresponding talk on audiodharma.org to create a single, enriched document.

This process is fully automated to run daily via a GitHub Action, which scrapes for new content, generates the articles, and pushes the results back to this repository.

Installation

Set up a Python virtual environment and install the required dependencies.

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

You must also have the gemini-cli tool installed and configured in your system’s PATH.

Script Usage

The main script, download.py, is organized into three subcommands.

1. scrape_and_generate (Primary Command)

This is the main command for daily use. It automates the entire pipeline: scraping audiodharma.org for the latest data, processing the corresponding YouTube videos, and generating the final HTML interface.

python download.py scrape_and_generate [options]

By default, this command uses the audiodharma.org scrape as the source of truth for which videos to process. It efficiently scans and only processes new or updated talks.

Options:

2. audiodharma

This command only scrapes audiodharma.org to update the local data cache.

python download.py audiodharma [options]

Options:

3. youtube

This command processes YouTube videos directly, without touching the audiodharma.org data. It’s useful for processing single videos or specific playlists.

Subcommands:

Example:

# Process a single video by its YouTube ID
python download.py youtube video-id "dQw4w9WgXcQ"

Project Structure