Document classification

The Classifier algorithm finds documents in the image and assigns them a document type.We support a wide array of documents for Eastern Europe, and can train our algorithms to process any custom type of document with a provided dataset (as we did for invoices, tax and covid forms).

Algorithm of the API /classify method

The algorithm looks for rectangular shapes on the incoming image that look like documents and cuts them out.
The Classifier assigns a class to each cut out area: Passport, Driver’s Licence, and so on. A list of currently supported document types is available at the link.
The algorithm evaluates the orientation of the document in space. If necessary, the classifier rotates or mirrors the document.

Types of cutout areas that the classifier does not rotate or mirror:

other - document of unknown type;
not_document - not a document, for example, a photo of a cat;
empty - empty page.

API specification

Below is the API specification for the document classification method. For more details on how to compose a classification request, see Connecting and testing.

classify

POST https://latest.handl.io/classify

Query Parameters

Name

Type

Description

min_shape

integer

>0, the default value is 256. The minimum size of the image in pixels on the short side. If smaller, the low_image_weight parameter in the response will return true. If larger - false.

min_filesize

integer

>0, the default value is 10240. Minimal image weight in bytes. If less, the low_image_weight parameter in the response will return true. If more - false.

max_exposure_score

number

>0, default value is 0.4. Maximum exposure of the image. If the exposure is higher, the image_exposure parameter in the response will return overexposed. If less - normal.

min_exposure_score

number

>0, the default is 0.05. The minimum exposure of the image. If the exposure is less, the image_exposure parameter will return underexposed in the response. If higher - normal.

max_blur_score

number

>0, defaults to 2. The minimal coefficient of clarity of the image. If less, the image_blured parameter in the response will return true. If more, false.

doc_type

array

A list of document types to search for in the input file. It is used for deterministic processes, for example, if only the main spread of a passport needs to be found in the document stream, and all other types do not need to be utilized. By default all values of the parameter are selected (all types are available in the classifier).

priority

integer

>0, default value is 1. Priority of an asynchronous task in the queue for processing.

simple_cropper

boolean

false (default) - the simplified algorithm of cutting documents from images is not used. true - the simplified algorithm of cutting documents from images is used: it is faster, but the result is less accurate. On images with a complex background documents may be cut out less accurately.

async

boolean

true - asynchronous mode of processing requests. false - synchronous mode of request processing.

check_fake_experimental

boolean

This one is out of date and is not used.

check_fake

boolean

true - the algorithm searches the file metadata for signs of modification via digital editors, the result is returned in a separate field called “fake”. false - the metadata checking algorithm is disabled.

pdf_raw_images

boolean

true - the algorithm leaves the decision of PDF files’ rasterization to the auto_pdf_raw_images parameter. false - all PDF files will be rasterized, the value of the auto_pdf_raw_images parameter will be ignored.

auto_pdf_raw_images

boolean

true - the algorithm leaves the decision of PDF files’ rasterization to the auto_pdf_raw_images parameter. false - the algorithm will never rasterize PDF.

dpi

integer

>0, the default value is 300 - sets the number of pixels per inch for PDF rasterization. We recommend 300. Higher values usually do not increase the quality, but increase the weight of the image.

quality

integer

0-100, the default value is 75 - sets the degree of JPEG compression for PDF rasterization. The recommended value is 75 for balance between the weight of the image and its quality.

The request {
  "detail": [ // technical information
    {
      "loc": [ 
        "string" 
      ],
      "msg": "string",
      "type": "string" 
    }
  ],
  "items": [
    {
      "document": {
        "type": "bank_card", // document type
        "page": 0, // page number of the input file where the document was found
        "rotation": 0, // 4 options of document rotation by 90 degrees x 2 options of mirroring
        "coords": [ // coordinates of the document image in the input file
          [
            0
          ]
        ]
      },
      "crop": "string", // image of document in binary format
      "image_exposure": "normal", // document exposure
      "image_blured": false, // document clarity
      "low_image_resolution": true, // document resolution
      "low_image_weight": true // document image weight
    }
  ],
  "task_id": null, //task's internal id
  "code": null, //error code
  "message": null, // error message within the object
  "errno": null, // error number
  "traceback": null, // error message within the limits of object
  "fake": true, // response at the parameter check_fake = "true"
  "pages_count": 1, // number of pages in the input file
  "docs_count": 1 // number of documents in the input file
} successfully processed.

{
  "detail": [
    {
      "loc": [
        "string"
      ],
      "msg": "string",
      "type": "string"
    }
  ]
}

PreviousServices NextDocument recognition

Last updated 1 year ago

Was this helpful?