youtube-parser

yt-parser is a versatile YouTube video parser designed to streamline the extraction of transcripts from individual videos, playlists, or entire channels. By simply inputting a YouTube video URL, playlist URL, or channel URL, users can initiate the process, which automatically generates English transcripts for each video regardless of its original language. These transcripts are then efficiently segmented and transmitted as JSON responses to both Minio and webhooks for seamless integration with downstream applications. Leveraging this parsed content, the system empowers Bharat Sahai Yak to orchestrate a sophisticated chatbot experience based on the extracted insights, facilitating dynamic interactions and informed responses.

Working and architecture

User Input: User submits a URL and URL type to the system.
Generate Video IDs: Video IDs are generated for all videos in the playlist, channel, or for a single video.
Retrieve Transcripts: Video IDs are used to retrieve the YouTube-generated transcripts from the YouTube Data API.
Chunk Transcripts (if needed): Transcripts are chunked into segments of 4 minutes.
Translate Non-English Transcripts: Non-English transcripts are translated to English if available.
Send Transcript Chunks via Webhook: Each transcript chunk is sent via webhook to the designated receiver in a predefined format.
Store Transcript Chunks to Minio: Transcript chunks are saved to Minio in both JSON and CSV formats.
Invoke ASR Way (if English transcript unavailable):
- Download and Split Video Audio: Video audio is downloaded and split into 4-minute chunks.
- Process Audio using ASR: Each audio chunk is processed using ASR services.
- Send ASR Results via Webhook: ASR results are sent via webhook to the receiver.
Architecture Diagram:

Architecture Diagram

Setup - Docker Way

To set up yt-parser using Docker, follow these steps:

Install Docker: Ensure Docker is installed on your system. You can download and install Docker from here.
Build Docker Image:
```
docker build -t yt-parser:latest 
```
Run Docker Compose:
```
docker-compose up -d
```
yt-parser should now be running and accessible at http://localhost:8000

Setup - Manual Way

For manual setup, follow these instructions:

Install Dependencies: Ensure you have Python 3.10 installed on your system. Additionally, install ffmpeg:
```
sudo apt-get update
sudo apt-get install ffmpeg
```

2. Manual Setup Way

For manual setup, follow these instructions:

Install Dependencies: Ensure you have Python 3.10 installed on your system. Additionally, install ffmpeg:
```
sudo apt-get update
sudo apt-get install ffmpeg
```

Create and activate virtual enviroment

python3 -m venv venv 
source venv/bin/activate

Install Python Dependencies: Install Python dependencies using Poetry:
```
poetry install
```
Set Environment Variables: Create a .env file based on the sample environment file provided (sample.env). You can find the sample environment file in the repository here.
Example:
```
cp sample.env .env
```
Start Redis Server: Start the Redis server. You can use Docker or install Redis locally.
Run yt-parser: Start the yt-parser application:
```
uvicorn main:app --host 0.0.0.0 --port 8000
```
Run Celery Worker: Start the Celery worker:
```
celery -A worker worker --loglevel=info
```
Access yt-parser: yt-parser should now be running and accessible at http://localhost:8000.

API Endpoints

1. /download

GET

/download Retrieves download links for processed transcript files.

Query Parameters

Name	Type	Required	Description
taskId	String	YES	The task ID.

Curl Request

curl http://localhost:8000/download?taskId=<TASK_ID_HERE>

Response Format

Success

{
  "error": false,
  "data": {
    "batchId": "<BATCH_ID_HERE>",
    "taskId": "<TASK_ID_HERE>",
    "url": "<URL_HERE>",
    "files": {
      "<TASK_ID_HERE>": [
        {
          "start": "0",
          "end": "50",
          "url": {
            "md": "<URL_MD_HERE>"
          }
        }
      ]
    }
  }
}

Failure:

{
  "error": true,
  "message": "Error message",
}

2. /process

POST

/process Initiates transcripts processing for the provided YouTube URL.

Request Body

Name	Type	Required	Description
url	String	YES	The YouTube URL (video, playlist, or channel).
type	String	YES	The type of URL (valid types: "video", "playlist", "channel").
webhookUrl	String	NO	The webhook URL to receive notifications (overrides environment variable).
language	String	NO	The language of the video in which transcription will happen.

Curl Request

curl -X POST -H "Content-Type: application/json" -d '{"url": "<URL_HERE>", "type": "<TYPE_HERE>"}' http://localhost:8000/process

Response Format

Success:

{
  "error": false,
  "message": "Task initiated successfully",
  "data": {
    "batchId": "batch_id",
    "taskId": ["task_id_1", "task_id_2", ...]
  }
}

Failure:

{
  "error": true,
  "message": "Error message",
  "data": {}
}

3. /status

GET

/status Retrieves the status of the processing tasks.

Query Parameters

Name	Type	Required	Description
taskId	String	NO	The task ID.
batchId	String	NO	The batch ID.

Curl Request

# If you have taskId
curl http://localhost:8000/status?taskId=<TASK_ID_HERE>

# If you have batchId
curl http://localhost:8000/status?batchId=<BATCH_ID_HERE>

# If you have both taskId and batchId
curl http://localhost:8000/status?taskId=<TASK_ID_HERE>&batchId=<BATCH_ID_HERE>

Response Format

taskId and batchId provided:

 {
   "error": false,
   "message": "Transcription is in progress.",
   "data": {
      "batchId": "<batch_id>",
      "taskId": "<task_id>",
      "taskStatus": {
            "percent": 50,
            "total": 10,
            "stage": ""
      },
      "batchStatus": {
            "percent": 20,
            "total": 5
      }
   }
}

batchId provided

{
    "error": false,
    "message": "Transcription is in progress.",
    "data": {
        "batchId": "<batch_id>",
        "batchStatus": {
            "percent": 20,
            "total": 5
        }
    }
}

taskId provided

{
    "error": false,
    "message": "Transcription is in progress.",
    "data": {
        "batchId": "<batch_id>",
        "taskId": "<task_id>",
        "taskStatus": {
            "percent": 50,
            "total": 10,
            "stage": ""
        }
    }
}

Failure:

{
  "error": true,
  "message": "Error message",
}

WEBHOOK Format

Webhook is been triggered in following events:

A successfull chunk is created (either through API or explicit transcription).
An error occurs after taskId is generated.

Webhook Response format:

Success:

{
  "error":False,
  "data":{
    "batchId":batch_id,
    "taskId":task_id,
    "url": <video url>,
    "chunks":[
        {
            "content": {
            "heading":'',
            "text":"Hey this is the transcript"
            }
        }
    ],
    "metadata":{
        "start":0,
        "end":240
    },
    "taskStatus": {"percent":100,"stage":"Transcribing","total":4},
    "batchStatus": {"percent":66,"total":36}
  }
  "message":""
}

Failures:

{
  "error":True,
  "data":{
    "batchId":batch_id,
    "taskId":task_id,
    "url": <video url>,
    "chunks":[],
    "metadata":{},
    "taskStatus": {"percent":100,"stage":"Transcribing","total":4},
    "batchStatus": {"percent":66,"total":36}
  }
  "message":"Error: "
}

Working and architecture​

Architecture Diagram:​

Setup - Docker Way​

Setup - Manual Way​

2. Manual Setup Way​

API Endpoints​

1. /download​

Query Parameters

Curl Request

Response Format

2. /process​

Request Body

Curl Request

Response Format

3. /status​

Query Parameters

Curl Request

Response Format

WEBHOOK Format

Contents

Working and architecture

Architecture Diagram:

Setup - Docker Way

Setup - Manual Way

2. Manual Setup Way

API Endpoints

1. /download

2. /process

3. /status