PDF Parser

What Does PDF-Parser do?

PDF parser provides an easy way to extract and chunk the information from PDF file. It can extract text, tables and even embed images from the PDF file. This chunked information can be used for a variety of puposes such as RAG for Large Language Models. The major steps in PDF Parser are:

PDF -> MD - This is achieved using the fork of an external tool - Marker which has been modified to integrate into our document flow
MD parsing using Marked - This is a markdown parser which converts the MD to a temporary JSON, which helps in easier chunking and manipulation of the data.
We detect the headings from the parsed MD and chunk the data into CSV and a final JSON, while adding metadata to each chunk like the startPage,endPage,pdfName etc.
After detecting the text, PDF Parser starts detecting images in the PDF and embeds it into the previously created chunks.
Finally, we summarize the content of the chunks.

Steps involved in PDF-Parsing

Each page is extracted from the pdf (depending on parameters passed for number of pages to be processed at a time)
Text is extracted along with the structure metadata (bbox of each word) in the form using PyMuPDF
If no text is extracted/ text extracted has an unkown encoding, it is sent to Pytesseract and text is OCRed for the required language (language should be provided at input- functionality to predict the language is not yet built in- can be done using clip
Extracted text is is segmented into 'blocks' (line of the same font) along with the bbox for each block
Blocks are sent to object detection and segment detector (Layout detector and column detector )
Each block is classifed into heading, text line, table- row/column/heading
Latex equations are converted into the right .md format
The post processsing for converting text to .md format for each block is done
Blocks are stitched together for a page and Page numbers are added to each page
The pages are stitched together
The .md of the entire batch of pages is converted to json format with the required strcuture (content is mapped to headings and subheading, subheading to heading etc)
The json is flattened out to a csv with chunks of content mapped to headings
The chunks are post-processed to break down big chunks too smaller and smaller chunks combined to one to maintain uniformity of size
The chunks are embedded and stored in a vector DB, summaries are generated if required, transaltions are done if required
Images are detected for all pages through the pdf and sent to GPT for summarization and to minio for storage
The minio links for the image are inserted back into the chunked csv along with embeddings of the image summaries

Setting up locally

Clone the repository locally

git clone https://github.com/BharatSahAIyak/pdf-parser/ && cd pdf-parser

Create and activate venv. Ex:

python3.10 -m venv venv
source venv/bin/activate &&

This project uses poetry.

pip install poetry==1.7.0
poetry install

Initialize the github sub-repository for this project

git submodule update --init --recursive

Install the requirements for the submodules

cd marker_final && poetry install

For parsing Hindi PDF's install tesseract to your system as well

apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-hin

Setup a .env file in the root directory.

cp sample.env .env

Update the created env file with your OPENAI and BHASHINI keys along with MINIO information

TESSDATA_PREFIX is the location of the tesseract installation on your system

Start the Docker Containers

REDIS

docker run -d -p 6379:6379 --name my-redis redis

MinIO

mkdir -p ~/minio/data

docker run \
   -p 9000:9000 \
   -p 9001:9001 \
   --name minio \
   -v ~/minio/data:/data \
   -e "MINIO_ROOT_USER=ROOTNAME" \
   -e "MINIO_ROOT_PASSWORD=CHANGEME123" \
   quay.io/minio/minio server /data --console-address ":9001"

Start the Celery Worker

celery -A worker worker --loglevel=info

Start the Web Server

uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000 --reload

Architecture

API specification

Download

GET

/download/ Generate download links for a particular `taskId`

This endpoint is used to return the MinIO links for the processed parts associated with a given taskId

Query Parameters

Name	Type	Required	Description
taskId	String	YES	The taskId returned when starting a conversion with `/process/`
format	String	NO	Specifies the type of response (`md`, `json`, `csv`, or `lexer`)

When format is not passed, all 3 formats, md,csv and json are returned

Example request using cURL

When `format` is passed in the query

curl --location 'localhost:8000/download/?format=json&taskId=<taskId>'

{
  "data": {
    "batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
    "taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
    "url": "<URL for original file>",
    "files": {
      "828c22d3-b103-460e-b8ae-c1ead28f4186": [
        {
          "start": "0",
          "end": "2",
          "url": {
            "json": "<URL FOR THE JSON FILE>"
          }
        }
      ]
    }
  },
  "error": false
}

When `format` is not passed in the query

curl --location 'localhost:8000/download/?taskId=<taskId>'

{
  "data": {
    "batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
    "taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
    "url": "PDF URL",
    "files": {
      "828c22d3-b103-460e-b8ae-c1ead28f4186": [
        {
          "start": "0",
          "end": "2",
          "url": {
            "pdf": "URL FOR THE PART",
            "md": "MD URL",
            "csv": "CSV URL",
            "json": "JSON URL"
          }
        }
      ]
    }
  },
  "error": false
}

Process

POST

/process/ Upload a batch of PDFs and generate chunks

Request Body

Name	Type	Required	Description
language	String	YES	Language of the uploaded PDF (`english` or `hindi`)
file	File	YES	The PDF file to be processed
url	String	YES	Link to a PDF file
split	Integer	NO	The number of pages to process from the PDF in one iteration
to	String	NO	The final stage after processing (`md` for processing till md. Default end is JSON)
webhookURL	String	NO	The webhook to hit on completion of a PDF part
summarize	Boolean	NO	Boolean value to control if summation should be done after simple parsing
images	Boolean	NO	Boolean value to control if images should be extracted from the PDF after simple parsing

Either one of file or url can be processed at a time

Example Request using cURL

curl  --location 'localhost:8000/process/' \
      --form 'file=@"/home/shady/example_english.pdf"' \
      --form 'language="English"' \

curl  --location 'localhost:8000/process/' \
      --form 'url="https://ncert.nic.in/pdf/NCFSE-2023-August_2023.pdf"' \
      --form 'language="English"' \
      --form 'split="50"'

Response

Success Response

{
  "data": {
    "batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
    "taskId": ["322b9125-15e4-4e94-b4bf-13b6f79f8acb"]
  },
  "error": false
}

Failure Response

{
  "message": "Something went wrong. Please try again after some time.",
  "error": true,
  "data": {
    "errorId": "<ERROR_ID>"
  }
}

Status

GET

/status/ Retrieve the status using batchId or taskId

Query Parameters

Name	Type	Required	Description
taskId	String	YES	One of the taskIds returned when starting a batch conversion with `/process/`
batchId	String	YES	The batchId returned when starting a conversion using `/process/`

You can pass either one or both of taskId and batchId

Example requests using cURL

With both `taskId` and `batchId` passed

curl --location 'localhost:8000/status/?taskId=<taskID>&batchId=<batchId>'

{
  "error": false,
  "data": {
    "batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
    "taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
    "taskStatus": {
      "percent": 100,
      "total": 2,
      "stage": "simple-parsing"
    },
    "batchStatus": {
      "percent": 100,
      "total": "1"
    }
  },
  "message": ""
}

With `batchId` passed

curl --location 'localhost:8000/status/?batchId=<batchId>'

{
  "error": false,
  "data": {
    "batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
    "batchStatus": {
      "percent": 100,
      "total": "1"
    }
  },
  "message": "SUCCESS"
}

With just `taskId`

curl --location 'localhost:8000/status/?taskId=<taskId>'

{
  "error": false,
  "data": {
    "taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
    "batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
    "taskStatus": {
      "percent": 100,
      "total": 2,
      "stage": "simple-parsing"
    }
  },
  "message": "SUCCESS"
}

Connecting to personal service

In the /process/ endpoint, there is an optional paramter , webhookURL, which is hit everytime a split of a PDF is completed. The payload for this is:

{
    "error": false,
    "data": {
          "batchId": "",     // UUID for the batch of PDFs being processed
          "taskId": "a57313b3-4293-4d9f-98cb-4be366ec23ff",     //UUID for the PDF
          "url": "",     //miniIO link for the PDF
          "chunks": [    //contains the extracted data from the PDF
              {
                  "content": {
                     "heading":"",
                     "text": "",
                  }
                  "id": "" ,

              }
          ],
          "metadata": {
              "end": 2,
              "start": 0,
              "id": ""  // ID of the part being processed
          },
        "taskStatus": {
            "percent": 100,
            "total": "",   //total number of pages in the pdf being processed
            "stage": "simple-parsing"
        },
        "batchStatus": {
            "percent" : ,
            "total": ""     //total number of pdfs in the batch
        }
   },
   "message":"",
}

What Does PDF-Parser do?​

Steps involved in PDF-Parsing​

Setting up locally​

Start the Docker Containers​

Start the Celery Worker​

Start the Web Server​

Architecture​

API specification​

Download​

Query Parameters

Example request using cURL

When `format` is passed in the query

When `format` is not passed in the query

Process​

Request Body

Example Request using cURL​

Response

Success Response

Failure Response

Status​

Query Parameters

Example requests using cURL

With both `taskId` and `batchId` passed

With `batchId` passed

With just `taskId`

Connecting to personal service​

Contents

What Does PDF-Parser do?

Steps involved in PDF-Parsing

Setting up locally

Start the Docker Containers

Start the Celery Worker

Start the Web Server

Architecture

API specification

Download

Process

Example Request using cURL

Status

Connecting to personal service