This project downloads video transcripts from YouTube and scrapes talk metadata from audiodharma.org. It processes the raw transcripts using a Generative AI to produce cleaned, formatted, and enriched markdown files suitable for a personal knowledge base.
The transcription articles generated by this project are browsable via a generated HTML interface here.
The project has two main data sources that it synthesizes:
audiodharma.org: The script scrapes this site to build a local cache of talk metadata (titles, speakers, series) and speaker information. This is considered the primary source of truth for talk details.The core logic, executed by the scrape_and_generate command, links a YouTube video to its corresponding talk on audiodharma.org to create a single, enriched document.
This process is fully automated to run daily via a GitHub Action, which scrapes for new content, generates the articles, and pushes the results back to this repository.
Set up a Python virtual environment and install the required dependencies.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
You must also have the gemini-cli tool installed and configured in your system’s PATH.
The main script, download.py, is organized into three subcommands.
scrape_and_generate (Primary Command)This is the main command for daily use. It automates the entire pipeline: scraping audiodharma.org for the latest data, processing the corresponding YouTube videos, and generating the final HTML interface.
python download.py scrape_and_generate [options]
By default, this command uses the audiodharma.org scrape as the source of truth for which videos to process. It efficiently scans and only processes new or updated talks.
Options:
--source [audiodharma|imc-playlist]: Determines the source of truth for which videos to process.
audiodharma (default): Processes videos that are found in the audiodharma.org scrape. This is the most reliable method.imc-playlist: Processes videos directly from the IMC YouTube channel’s “uploads” playlist. This can be used to find videos not yet listed on the website.--limit <number>: Limit the number of videos to process. Default is 0 (no limit).--force-ai-processing: Force the AI processing step, even if the final markdown file already exists.--force-redownload-transcripts: Force re-downloading of raw transcripts, even if they are already cached.--do-not-stop-scan: By default, the script stops when it finds a series of up-to-date, existing articles. Use this flag to continue processing all videos in the list.audiodharmaThis command only scrapes audiodharma.org to update the local data cache.
python download.py audiodharma [options]
Options:
--start_page <number>: The page number to start scraping from.--max_pages <number>: The maximum number of pages to scrape.--save-after-pages <number>: How frequently to save the cache files during the scrape.youtubeThis command processes YouTube videos directly, without touching the audiodharma.org data. It’s useful for processing single videos or specific playlists.
Subcommands:
video-id <id>: Process a single video by its ID.channel-url <url>: Process videos from a channel URL.playlist-id <id>: Process videos from a playlist ID.Example:
# Process a single video by its YouTube ID
python download.py youtube video-id "dQw4w9WgXcQ"
download.py: The main script orchestrating all operations.youtube.py, audiodharma.py, ai.py, article.py, cache.py, filesystem.py: Modules containing the core logic for interacting with services, processing data, and managing files.generate_html.py: Script to generate the HTML interface for browsing talks..github/workflows/scrape-process-publish.yml: The GitHub Action definition for daily execution.cache/: Contains all cached data, including scraped website info, YouTube metadata, and raw transcripts.talks/: Stores the final, AI-cleaned and formatted markdown files, along with the generated HTML interface.