Introduction to Amazon Textract

by Kurt Feeley

Amazon Textract is a powerful, fully managed, and highly featured machine learning based document scanning service. Amazon Textract can extract things like text and handwriting, but can also extract, identify, and understand data from scanned documents like tables, forms and key-value pairs. Using Amazon Textract’s pretrained features, Amazon Textract can be used to quickly get your automated document processes up and running. If the pretrained features do not meet your document processing requirements, Amazon Textract allows for customization of the provided pretrained features to help you accomplish your document scanning needs.

Amazon Textract excels at automating what may be manual processes. For example, as part of a loan approval process, Amazon Textract could be used to scan pay stubs and bank statements to provide faster and more efficient loan decisions.

Photo by Christina Radevich on Unsplash

The Solution

In this tutorial, we’ll bring you through building a simple app that uses Amazon Textract to analyze a document stored in Amazon S3, extracts the text and prints the extracted text to the screen.

Remember, for any example solution from AWS with .NET, we focus on the code that exemplifies the problem we are trying to solve. We don’t include logging, input validation, exception handling, etc., and we sometimes embed the configuration data within classes instead of using environment variables, configuration files, key/value stores and the like. These items should not be skipped for proper solutions.

Prerequisites

To complete this solution, you will need the .NET CLI which is included in the .NET SDK. In addition, you will need to create an AWS IAM user with programmatic access with the appropriate permissions to interact with Amazon Textract. In addition, you will need to download the AWS CLI and configure your environment.

Warning: some AWS services may have fees associated with them.

Our Dev Environment

This tutorial was developed using Ubuntu 24.04.3, .NET 8 SDK and Visual Studio Code 1.108.1. Some commands/constructs may vary across systems.

Developing the .NET Amazon Textract Application

S3 Document Storage

To run through this “step-by-step”, we will first need to upload a document to S3 which contains text. For this article, we are uploading a PDF which contains the text, “This is a test”.

You can use multiple methods to upload a file to S3, including the console, SDK or the AWS CLI.

AWS CLI Example

$ aws s3 cp PATH_TO_YOUR_LOCAL_FILE s3://S3_BUCKET_NAME/

Once the file is uploaded, you will need to record:

S3 Bucket Name
Uploaded File Name

Create the .NET Textract Application Project

With the file uploaded, let’s start to build out the app.

First let’s create the .NET “TextExtract” App using the .NET CLI. As you’ll notice, we are creating a .NET console app for this article.

$ dotnet new console -n TextExtract --use-program-main

Add Nuget Packages

Let’s now add the AWS .NET SDK dependencies.

First, let’s add the AWSSDK.Core package which is the main component of the .NET AWS SDK.

$ dotnet add package AWSSDK.Core

With the AWSSDK.Core package added, let’s now add the package that will allow us to work with Amazon Textract.

$ dotnet add package AWSSDK.Textract

Coding the .NET, Amazon Textract App

Encapsulate Amazon Textract

The first step is to create a class to encapsulate the Amazon Textract functionality. Let’s call it, “TextExtractor”.

public class TextExtractor
{

}

Let’s now build out the TextExtractor class.

First, let’s add our using statements to pull in our dependencies for Amazon Textract.

using Amazon.Textract;
using Amazon.Textract.Model;

Next, we’ll need to instantiate an instance of the AmazonTextractClient in order to make the calls through the SDK to the Amazon Textract API. For simplicity sake, we have created a field and then instantiated the client in the constructor, but you can easily modify this to use your favorite dependency injection framework.

AmazonTextractClient _client;

public TextExtractor()
{
    _client = new AmazonTextractClient();
}

We’ll also need two methods, “StartExtractionJob” and “Extract”.

Coding the StartExtractionJob method.

The StartExtractionJob method will have two parameters, “s3BucketName”, “s3FileName” and will return the Amazon Textract Job Id as a string.

public async Task<string> StartExtractionJob(string s3BucketName, string s3FileName) {


}

The first thing we need to do when scanning a document for text using Amazon Textract is to start a job.

Let’s instantiate a StartDocumentTextDetectionRequest object. We’ll use the S3 data that we recorded earlier for the file that we need to scan.

var request = new StartDocumentTextDetectionRequest();

request.DocumentLocation = new DocumentLocation();

request.DocumentLocation.S3Object = new S3Object
{
    Bucket = s3BucketName,
    Name = s3FileName
}

Once we have the object created and our data set for the request, we’ll send the request and wait for the response from the .NET AWS SDK.

StartDocumentTextDetectionResponse startDocumentTextDetectionResponse = 
        await _client.StartDocumentTextDetectionAsync(request);

The last step in the StartExtractionJob method is to capture and return the response’s “JobId” for later use.

string jobId = startDocumentTextDetectionResponse.JobId;

return jobId;

Here’s the complete method.

public async Task<string> StartExtractionJob(string s3BucketName, string s3FileName){

    var request = new StartDocumentTextDetectionRequest();

    request.DocumentLocation = new DocumentLocation();
    request.DocumentLocation.S3Object = new S3Object
    {
        Bucket = s3BucketName,
        Name = s3FileName
    };

    StartDocumentTextDetectionResponse startDocumentTextDetectionResponse = 
            await _client.StartDocumentTextDetectionAsync(request);

    string jobId = startDocumentTextDetectionResponse.JobId;

    return jobId;
}

Coding the Extract method.

Now, with the job started, we need to check in on its progress. We’ll do that in the “Extract” method.

The Extract method takes one parameter, the Amazon Textract Job Id and returns a comma delimited string representing the words that were found in the document that was scanned.

public async Task<String> Extract(string jobId){


}

We’ll start this method out like we did in the previous method and create a request. This time we will need to instantiate a GetDocumentTextDetectionRequest object providing the “JobId”.

We’ll also create a variable named, “returnString” to hold the text that has been extracted from the document or an error message.

var request = new GetDocumentTextDetectionRequest()
{
    JobId = jobId
};

string returnString = "";

With the request object complete, it’s time to poll Amazon Textract and check to see if the job is complete. We’ll do this by using a while loop and a boolean loop control variable named, “isJobComplete”. The while loop will stop iterating when isJobComplete has a value of true.

bool isJobComplete = false;

while (!isJobComplete)
{

}

For each iteration of the loop, a request will be sent to Amazon Textract, polling for the “JobStatus”. Once the JobStatus comes back as “SUCCEDED”, isJobComplete is set to true and the loop will cease.

GetDocumentTextDetectionResponse response = 
        await _client.GetDocumentTextDetectionAsync(request);

if (response.JobStatus == "SUCCEEDED")
{
    isJobComplete = true;
}

As for the JobStatus, in addition to “SUCCEDED” it can also have values like, “FAILED”, “IN_PROGRESS” and “PARTIAL_SUCCESS”. For this simple example we will handle the responses of “SUCCEEDED”, “PARTIAL_SUCCESS”, and “FAILED”. Although, all values should be handled in a proper solution.

Importantly, when a job succeeds, we will want to parse the response and in this example, we want to capture all the words that were found and if no words were found, we return “No words found”.

Lastly, if the JobStatus does not equal, “SUCCEEDED”, “PARTIAL_SUCCESS”, or “FAILED”, we will pause for 5 seconds and then poll again as the JobStatus is, “IN_PROGRESS”.

if (response.JobStatus == "SUCCEEDED")
{
    Console.WriteLine("Job Complete and Succeded");

    String words = String.Join(", ",
        response.Blocks.Where(x => x.BlockType == "WORD")
            .Select(x => x.Text));

    returnString = words ?? "No words found";
    isJobComplete = true;
}
else if (response.JobStatus == "FAILED")
{
    returnString = "Error: Job Failed";
    isJobComplete = true;
}
else if (response.JobStatus == "PARTIAL_SUCCESS")
{
    returnString = "Error: Job Incomplete";
    isJobComplete = true;
}
else if (response.JobStatus == "IN_PROGRESS")
{
    Console.WriteLine("Job In Progress");
    await Task.Delay(5000);
    isJobComplete = false;
}

The completed “Extract” method is below. Note, in addition to checking the JobStatus, we wrap everything in a try/catch block to catch the “InvalidJobIdException”, if it arises. This condition occurs when a JobId is provided to the SDK, but is not available in Amazon Extract. We also catch any other exceptions and handle them.

public async Task<String> Extract(string jobId)
{
    string returnString = "";

    var request = new GetDocumentTextDetectionRequest()
    {
        JobId = jobId
    };

    bool isJobComplete = false;

    while (!isJobComplete)
    {
        try
        {
            GetDocumentTextDetectionResponse response = 
                    await _client.GetDocumentTextDetectionAsync(request);

            if (response.JobStatus == "SUCCEEDED")
            {
                Console.WriteLine("Job Complete and Has Succeded");

                String words = String.Join(", ",
                    response.Blocks.Where(x => x.BlockType == "WORD")
                        .Select(x => x.Text));

                returnString = words ?? "No words found";
                isJobComplete = true;
            }
            else if (response.JobStatus == "FAILED")
            {
                returnString = "Error: Job Failed";
                isJobComplete = true;
            }
            else if (response.JobStatus == "PARTIAL_SUCCESS")
            {
                returnString = "Error: Job Incomplete";
                isJobComplete = true;
            }
            else if (response.JobStatus == "IN_PROGRESS")
            {
                Console.WriteLine("Job In Progress");
                await Task.Delay(5000);
                isJobComplete = false;
            }

        }
        catch (InvalidJobIdException ex)
        {
            returnString = "Error: Invalid Textract Job Id, " + jobId;
            isJobComplete = true;
        }
        catch (Exception ex)
        {
            returnString = "Error: Unspecified error type.";
            isJobComplete = true;
        }

    }

    return returnString;
}

Implement Text Extraction in Program.cs

With the TextExtractor class complete, let’s now move over to the Program class where we will implement the TextExtractor methods that we just completed.

We’ll want this app to run asynchronously. For that, we’ll need to make one small adjustment.

Let’s open the Program.cs file and change the “Main” function definition to:

static async Task Main(string[] args)

Within that Main function, we’ll create two variables that we’ll use to hold the data needed for the app. These variable values will be passed in through the command line interface as parameters.

The first variable, s3BucketName, will be used to specify the S3 bucket that will contain the file that we’ll process using .NET and Amazon Textract.

The second variable, s3FileName, will be the actual file that will be processed within the aforementioned S3 bucket with .NET and Amazon Textract.

We’ll also create an instance of the TextExtractor class that we just coded. We’ll use this TextExtractor instance to execute the methods that we just developed, StartExtractionJob and Extract.

String s3BucketName = args[0];
String s3FileName = args[1];
TextExtractor textExtractor = new TextExtractor();

The first thing we’ll do is create an Amazon Textract Job by calling the “StartExtractionJob” method on the TextExtractor instance and provide the method the values for the S3 bucket name and the name of the file to be scanned for text.

We’ll also capture the returned Amazon Extract Job Id using a variable named, “jobId”.

String jobId = await textExtractor.StartExtractionJob(s3BucketName, s3FileName);

Next, we’ll provide the JobId to the Extract method on the TextExtractor instance and we’ll capture the method’s return value in a variable named, “extractedText”.

And, the last part of this simple example is to output the list of extracted words to the console.

String extractedText = await textExtractor.Extract(jobId);

Console.WriteLine($"Extracted Text: {extractedText}");

Here’s the completed Program.cs file.

namespace TextExtract;

class Program
{
    static async Task Main(string[] args)
    {
        String s3BucketName = args[0];
        String s3FileName = args[1];
        TextExtractor textExtractor = new TextExtractor();

        String jobId = await textExtractor.StartExtractionJob(s3BucketName, s3FileName);

        if (string.IsNullOrEmpty(jobId))
        {
            Console.WriteLine("Failed to start extraction job.");
            return;
        }

        Console.WriteLine($"Started job with ID: {jobId}");

        String extractedText = await textExtractor.Extract(jobId);

        Console.WriteLine($"Extracted Text: {extractedText}");

    }
}

Testing the Amazon Textract .NET App

With the Amazon Textract .NET App coding complete, let’s give it a test.

From within the directory of the TextExtractor .NET app, let’s run the following command at the CLI. Note the following values:

S3 Bucket Name: amazon-textract-test-253454567
File Name: textract-test-file234636.pdf

$ dotnet run amazon-textract-test-253454567 textract-test-file234636.pdf

Once complete, the CLI should contain something like the following:

Started job with ID: 70d0fac1db9656490e6da0d03e2a965c1a78c07eec5
Job In Progress
Job In Progress
Job Complete and Succeded
Extracted Text: This, is, a, test

Summary

We have concluded this tutorial where you have learned how to:

Start an Amazon Textract Job.
Check the status of an Amazon Textract Job.
Extract text from a document using Amazon Textract.

Want to know more about the tech in this article? Check out these resources:

.NET CLI, .NET SDK, AWS .NET SDK, Amazon Textract