Skip to main content

youtube-parser

yt-parser is a versatile YouTube video parser designed to streamline the extraction of transcripts from individual videos, playlists, or entire channels. By simply inputting a YouTube video URL, playlist URL, or channel URL, users can initiate the process, which automatically generates English transcripts for each video regardless of its original language. These transcripts are then efficiently segmented and transmitted as JSON responses to both Minio and webhooks for seamless integration with downstream applications. Leveraging this parsed content, the system empowers Bharat Sahai Yak to orchestrate a sophisticated chatbot experience based on the extracted insights, facilitating dynamic interactions and informed responses.

Working and architecture

  1. User Input: User submits a URL and URL type to the system.

  2. Generate Video IDs: Video IDs are generated for all videos in the playlist, channel, or for a single video.

  3. Retrieve Transcripts: Video IDs are used to retrieve the YouTube-generated transcripts from the YouTube Data API.

  4. Chunk Transcripts (if needed): Transcripts are chunked into segments of 4 minutes.

  5. Translate Non-English Transcripts: Non-English transcripts are translated to English if available.

  6. Send Transcript Chunks via Webhook: Each transcript chunk is sent via webhook to the designated receiver in a predefined format.

  7. Store Transcript Chunks to Minio: Transcript chunks are saved to Minio in both JSON and CSV formats.

  8. Invoke ASR Way (if English transcript unavailable):

    • Download and Split Video Audio: Video audio is downloaded and split into 4-minute chunks.
    • Process Audio using ASR: Each audio chunk is processed using ASR services.
    • Send ASR Results via Webhook: ASR results are sent via webhook to the receiver.

    Architecture Diagram:

Architecture Diagram

Setup - Docker Way

To set up yt-parser using Docker, follow these steps:

  1. Install Docker: Ensure Docker is installed on your system. You can download and install Docker from here.

  2. Build Docker Image:

    docker build -t yt-parser:latest 
  3. Run Docker Compose:

    docker-compose up -d
  4. yt-parser should now be running and accessible at http://localhost:8000

Setup - Manual Way

For manual setup, follow these instructions:

  1. Install Dependencies: Ensure you have Python 3.10 installed on your system. Additionally, install ffmpeg:
    sudo apt-get update
    sudo apt-get install ffmpeg

2. Manual Setup Way

For manual setup, follow these instructions:

  1. Install Dependencies: Ensure you have Python 3.10 installed on your system. Additionally, install ffmpeg:

    sudo apt-get update
    sudo apt-get install ffmpeg
  2. Create and activate virtual enviroment

    python3 -m venv venv 
    source venv/bin/activate
  3. Install Python Dependencies: Install Python dependencies using Poetry:

    poetry install
  4. Set Environment Variables: Create a .env file based on the sample environment file provided (sample.env). You can find the sample environment file in the repository here.

    Example:

    cp sample.env .env
  5. Start Redis Server: Start the Redis server. You can use Docker or install Redis locally.

  6. Run yt-parser: Start the yt-parser application:

    uvicorn main:app --host 0.0.0.0 --port 8000
  7. Run Celery Worker: Start the Celery worker:

    celery -A worker worker --loglevel=info
  8. Access yt-parser: yt-parser should now be running and accessible at http://localhost:8000.

API Endpoints

1. /download

GET
/download Retrieves download links for processed transcript files.

Query Parameters

NameTypeRequiredDescription
taskIdStringYESThe task ID.

Curl Request

curl http://localhost:8000/download?taskId=<TASK_ID_HERE>

Response Format

  • Success
{
"error": false,
"data": {
"batchId": "<BATCH_ID_HERE>",
"taskId": "<TASK_ID_HERE>",
"url": "<URL_HERE>",
"files": {
"<TASK_ID_HERE>": [
{
"start": "0",
"end": "50",
"url": {
"md": "<URL_MD_HERE>"
}
}
]
}
}
}
  • Failure:
    {
    "error": true,
    "message": "Error message",
    }

2. /process

POST
/process Initiates transcripts processing for the provided YouTube URL.

Request Body

NameTypeRequiredDescription
urlStringYESThe YouTube URL (video, playlist, or channel).
typeStringYESThe type of URL (valid types: "video", "playlist", "channel").
webhookUrlStringNOThe webhook URL to receive notifications (overrides environment variable).
languageStringNOThe language of the video in which transcription will happen.

Curl Request

curl -X POST -H "Content-Type: application/json" -d '{"url": "<URL_HERE>", "type": "<TYPE_HERE>"}' http://localhost:8000/process

Response Format

  • Success:
    {
    "error": false,
    "message": "Task initiated successfully",
    "data": {
    "batchId": "batch_id",
    "taskId": ["task_id_1", "task_id_2", ...]
    }
    }
  • Failure:
    {
    "error": true,
    "message": "Error message",
    "data": {}
    }

3. /status

GET
/status Retrieves the status of the processing tasks.

Query Parameters

NameTypeRequiredDescription
taskIdStringNOThe task ID.
batchIdStringNOThe batch ID.

Curl Request

# If you have taskId
curl http://localhost:8000/status?taskId=<TASK_ID_HERE>

# If you have batchId
curl http://localhost:8000/status?batchId=<BATCH_ID_HERE>

# If you have both taskId and batchId
curl http://localhost:8000/status?taskId=<TASK_ID_HERE>&batchId=<BATCH_ID_HERE>

Response Format


  • taskId and batchId provided:

     {
    "error": false,
    "message": "Transcription is in progress.",
    "data": {
    "batchId": "<batch_id>",
    "taskId": "<task_id>",
    "taskStatus": {
    "percent": 50,
    "total": 10,
    "stage": ""
    },
    "batchStatus": {
    "percent": 20,
    "total": 5
    }
    }
    }
  • batchId provided

{
"error": false,
"message": "Transcription is in progress.",
"data": {
"batchId": "<batch_id>",
"batchStatus": {
"percent": 20,
"total": 5
}
}
}

  • taskId provided
{
"error": false,
"message": "Transcription is in progress.",
"data": {
"batchId": "<batch_id>",
"taskId": "<task_id>",
"taskStatus": {
"percent": 50,
"total": 10,
"stage": ""
}
}
}
  • Failure:
    {
    "error": true,
    "message": "Error message",
    }

WEBHOOK Format

Webhook is been triggered in following events:

  • A successfull chunk is created (either through API or explicit transcription).
  • An error occurs after taskId is generated.

Webhook Response format:

  • Success:

    {
    "error":False,
    "data":{
    "batchId":batch_id,
    "taskId":task_id,
    "url": <video url>,
    "chunks":[
    {
    "content": {
    "heading":'',
    "text":"Hey this is the transcript"
    }
    }
    ],
    "metadata":{
    "start":0,
    "end":240
    },
    "taskStatus": {"percent":100,"stage":"Transcribing","total":4},
    "batchStatus": {"percent":66,"total":36}
    }
    "message":""
    }
  • Failures:

    {
    "error":True,
    "data":{
    "batchId":batch_id,
    "taskId":task_id,
    "url": <video url>,
    "chunks":[],
    "metadata":{},
    "taskStatus": {"percent":100,"stage":"Transcribing","total":4},
    "batchStatus": {"percent":66,"total":36}
    }
    "message":"Error: "
    }