PDF Parser
What Does PDF-Parser do?
PDF parser provides an easy way to extract and chunk the information from PDF file. It can extract text, tables and even embed images from the PDF file. This chunked information can be used for a variety of puposes such as RAG for Large Language Models. The major steps in PDF Parser are:
- PDF -> MD - This is achieved using the fork of an external tool - Marker which has been modified to integrate into our document flow
- MD parsing using Marked - This is a markdown parser which converts the
MD
to a temporaryJSON
, which helps in easier chunking and manipulation of the data. - We detect the headings from the parsed MD and chunk the data into CSV and a final JSON, while adding metadata to each chunk like the
startPage
,endPage
,pdfName
etc. - After detecting the text, PDF Parser starts detecting images in the PDF and embeds it into the previously created chunks.
- Finally, we summarize the content of the chunks.
Steps involved in PDF-Parsing
- Each page is extracted from the pdf (depending on parameters passed for number of pages to be processed at a time)
- Text is extracted along with the structure metadata (bbox of each word) in the form using PyMuPDF
- If no text is extracted/ text extracted has an unkown encoding, it is sent to Pytesseract and text is OCRed for the required language (language should be provided at input- functionality to predict the language is not yet built in- can be done using clip
- Extracted text is is segmented into 'blocks' (line of the same font) along with the bbox for each block
- Blocks are sent to object detection and segment detector (Layout detector and column detector )
- Each block is classifed into heading, text line, table- row/column/heading
- Latex equations are converted into the right .md format
- The post processsing for converting text to .md format for each block is done
- Blocks are stitched together for a page and Page numbers are added to each page
- The pages are stitched together
- The .md of the entire batch of pages is converted to json format with the required strcuture (content is mapped to headings and subheading, subheading to heading etc)
- The json is flattened out to a csv with chunks of content mapped to headings
- The chunks are post-processed to break down big chunks too smaller and smaller chunks combined to one to maintain uniformity of size
- The chunks are embedded and stored in a vector DB, summaries are generated if required, transaltions are done if required
- Images are detected for all pages through the pdf and sent to GPT for summarization and to minio for storage
- The minio links for the image are inserted back into the chunked csv along with embeddings of the image summaries
Setting up locally
Clone the repository locally
git clone https://github.com/BharatSahAIyak/pdf-parser/ && cd pdf-parser
Create and activate venv. Ex:
python3.10 -m venv venv
source venv/bin/activate &&
This project uses poetry.
pip install poetry==1.7.0
poetry install
Initialize the github sub-repository for this project
git submodule update --init --recursive
Install the requirements for the submodules
cd marker_final && poetry install
For parsing Hindi PDF's install tesseract to your system as well
apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-hin
Setup a .env
file in the root directory.
cp sample.env .env
Update the created env file with your OPENAI
and BHASHINI
keys along with MINIO
information
TESSDATA_PREFIX
is the location of the tesseract installation on your system
Start the Docker Containers
- REDIS
docker run -d -p 6379:6379 --name my-redis redis
- MinIO
mkdir -p ~/minio/data
docker run \
-p 9000:9000 \
-p 9001:9001 \
--name minio \
-v ~/minio/data:/data \
-e "MINIO_ROOT_USER=ROOTNAME" \
-e "MINIO_ROOT_PASSWORD=CHANGEME123" \
quay.io/minio/minio server /data --console-address ":9001"
Start the Celery Worker
celery -A worker worker --loglevel=info
Start the Web Server
uvicorn main:app --workers 4 --host 0.0.0.0 --port 8000 --reload
Architecture
API specification
Download
This endpoint is used to return the MinIO links for the processed parts associated with a given taskId
Query Parameters
Name | Type | Required | Description |
---|---|---|---|
taskId | String | YES | The taskId returned when starting a conversion with /process/ |
format | String | NO | Specifies the type of response (md , json , csv , or lexer ) |
When format is not passed, all 3 formats, md
,csv
and json
are returned
Example request using cURL
When `format` is passed in the query
curl --location 'localhost:8000/download/?format=json&taskId=<taskId>'
{
"data": {
"batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
"taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
"url": "<URL for original file>",
"files": {
"828c22d3-b103-460e-b8ae-c1ead28f4186": [
{
"start": "0",
"end": "2",
"url": {
"json": "<URL FOR THE JSON FILE>"
}
}
]
}
},
"error": false
}
When `format` is not passed in the query
curl --location 'localhost:8000/download/?taskId=<taskId>'
{
"data": {
"batchId": "8101a566-2da0-4691-9c2d-184e8b439ecf",
"taskId": "828c22d3-b103-460e-b8ae-c1ead28f4186",
"url": "PDF URL",
"files": {
"828c22d3-b103-460e-b8ae-c1ead28f4186": [
{
"start": "0",
"end": "2",
"url": {
"pdf": "URL FOR THE PART",
"md": "MD URL",
"csv": "CSV URL",
"json": "JSON URL"
}
}
]
}
},
"error": false
}
Process
Request Body
Name | Type | Required | Description |
---|---|---|---|
language | String | YES | Language of the uploaded PDF (english or hindi ) |
file | File | YES | The PDF file to be processed |
url | String | YES | Link to a PDF file |
split | Integer | NO | The number of pages to process from the PDF in one iteration |
to | String | NO | The final stage after processing (md for processing till md. Default end is JSON) |
webhookURL | String | NO | The webhook to hit on completion of a PDF part |
summarize | Boolean | NO | Boolean value to control if summation should be done after simple parsing |
images | Boolean | NO | Boolean value to control if images should be extracted from the PDF after simple parsing |
Either one of file
or url
can be processed at a time
Example Request using cURL
curl --location 'localhost:8000/process/' \
--form 'file=@"/home/shady/example_english.pdf"' \
--form 'language="English"' \
curl --location 'localhost:8000/process/' \
--form 'url="https://ncert.nic.in/pdf/NCFSE-2023-August_2023.pdf"' \
--form 'language="English"' \
--form 'split="50"'
Response
Success Response
{
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskId": ["322b9125-15e4-4e94-b4bf-13b6f79f8acb"]
},
"error": false
}
Failure Response
{
"message": "Something went wrong. Please try again after some time.",
"error": true,
"data": {
"errorId": "<ERROR_ID>"
}
}
Status
Query Parameters
Name | Type | Required | Description |
---|---|---|---|
taskId | String | YES | One of the taskIds returned when starting a batch conversion with /process/ |
batchId | String | YES | The batchId returned when starting a conversion using /process/ |
You can pass either one or both of taskId
and batchId
Example requests using cURL
With both `taskId` and `batchId` passed
curl --location 'localhost:8000/status/?taskId=<taskID>&batchId=<batchId>'
{
"error": false,
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
"taskStatus": {
"percent": 100,
"total": 2,
"stage": "simple-parsing"
},
"batchStatus": {
"percent": 100,
"total": "1"
}
},
"message": ""
}
With `batchId` passed
curl --location 'localhost:8000/status/?batchId=<batchId>'
{
"error": false,
"data": {
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"batchStatus": {
"percent": 100,
"total": "1"
}
},
"message": "SUCCESS"
}
With just `taskId`
curl --location 'localhost:8000/status/?taskId=<taskId>'
{
"error": false,
"data": {
"taskId": "322b9125-15e4-4e94-b4bf-13b6f79f8acb",
"batchId": "ffd20e4f-da81-4a1c-8aba-1a774c0ed27a",
"taskStatus": {
"percent": 100,
"total": 2,
"stage": "simple-parsing"
}
},
"message": "SUCCESS"
}
Connecting to personal service
In the /process/
endpoint, there is an optional paramter , webhookURL
, which is hit everytime a split of a PDF is completed. The payload for this is:
{
"error": false,
"data": {
"batchId": "", // UUID for the batch of PDFs being processed
"taskId": "a57313b3-4293-4d9f-98cb-4be366ec23ff", //UUID for the PDF
"url": "", //miniIO link for the PDF
"chunks": [ //contains the extracted data from the PDF
{
"content": {
"heading":"",
"text": "",
}
"id": "" ,
}
],
"metadata": {
"end": 2,
"start": 0,
"id": "" // ID of the part being processed
},
"taskStatus": {
"percent": 100,
"total": "", //total number of pages in the pdf being processed
"stage": "simple-parsing"
},
"batchStatus": {
"percent" : ,
"total": "" //total number of pdfs in the batch
}
},
"message":"",
}