OCR and Elasticsearch with Ambar


#1

Hello, my name is Ilya. I’m from Ambar Team. It’ll be great if I can help you with OCR and document search.Our self-hosted version is much more powerful than cloud, it can process billions of documents. We can help you with installation and on-boarding.


OCR e table recognition
#2

Hi Ilya, nice to meet you!

First of all, let me review our project for you in English, because actually all documents and discussions are in Italian… :slight_smile:

Three earthquakes hit middle regions of Italy during last eight months and after the emergency we are interested in following and monitoring the rebuilding process in those areas. We are a no-profit organization, onData, with many active projects funded by bigger ONG or public grants or by the crowd. This place is the public forum of “Ricostruzione Trasparente” (transparent reconstruction) where the working team is coordinating the activities.

One of the actions we are working on is focused on the public acts published by municipalities and other local administrations about rebuilding. All administrations must publish on own websites these acts, but now there are hundreds of sources to monitor, so we want to scrape all these sources and collect data and documents in a single database, ie. Elasticsearch, to perform analysis and full-text searching.

We have no problem with data published on web pages, but the whole act is often published as a pdf file (sometimes containing text, sometimes scanned images). We want to perform full-text search also againts pdfs, of course.

We already have a running instance of ES v5, in single node mode during development, and your solution seems to be very interesting for us. I was investigating on Ingest Attachment Plugin and I found you… :slight_smile:

We can definitely test Ambar for our purpose, but I have some questions:

  • can we use an external ES cluster?
  • how do you perform OCR?
  • all our documents are in Italian, there can be language-related issues?

Moreover, I know @lorenzo_perone has submitted some documents to the cloud version with some troubles, I hope you can help him…

Thank you for your interest in our project, best!


#3

Hello @jenkin! Thank you for the brief.

can we use an external ES cluster?

yes, it’s no problem. The setting can be changed in docker-compose file

how do you perform OCR?

we use Tesseract internally with some tricky preprocessing. It supports Italian language as well

all our documents are in Italian, there can be language-related issues?

Nope, we can provide you with Italian language settings for our ES index (stemming, fuzzy search)

I know @lorenzo_perone has submitted some documents to the cloud version with some troubles, I hope you can help him…

Our cloud was overloaded with user requests, that’s why @lorenzo_perone documents were stuck in the processing queue.

@jenkin where do you store your documents? Is it ftp or smb or something else?

P.s. @lorenzo_perone if your documents still aren’t ready for search, please write to us in the help chat in Ambar Cloud.


#4

Now it works like a charm :slight_smile:


#5

Nice to hear it from you! :slight_smile:


#6

Hi @ambar,

I’ve tested the self-hosted version, here is my story… :slight_smile:

I had no problem during installation, but I encountered some port conflicts on starting: on my server I already have haproxy, apache2 and elasticsearch running, so I tuned the config json to change default ports and hosts. Unfortunately ES port is hard coded in the docker-compose template, so I changed lines 14 and 66 and added a line to ambar.py to manage the new parameter ES_PORT. Uh, check this line, it’s duplicated!

To serve the frontend I’m using a proxy on the server: services listening to localhost at custom ports are exposed on a simple domain: http://ambar.ricostruzionetrasparente.it/. There is a problem in ajax calls to API, maybe I was wrong to understand local and external configuration in config json.

Anyway, now I want to understand how to use my own instance of ES instead of the embedded one, using all needed configurations and mappings. You can see our current mapping here. In the enclosure type there is a content field where we would like to store text extracted from pdfs.

Is it possible? Can you help us step by step in this task?

Many thanks!


#7

Hi @jenkin

Oh, looks like you’ve made a great job with setting up Ambar :slight_smile:
And thanks for pointing at this duplicated row in ambar.py, we’ll fix it asap!

To serve the frontend I’m using a proxy on the server: services listening to localhost at custom ports are exposed on a simple domain: http://ambar.ricostruzionetrasparente.it/. There is a problem in ajax calls to API, maybe I was wrong to understand local and external configuration in config json.

It looks like your proxy doesn’t properly handle requests to Ambar API.
GET http://apiambar.ricostruzionetrasparente.it/api should return a JSON with API config, but in your case it doesn’t respond anyhow. Could you please send me your config.json?

Anyway, now I want to understand how to use my own instance of ES instead of the embedded one, using all needed configurations and mappings. You can see our current mapping here. In the enclosure type there is a content field where we would like to store text extracted from pdfs.

Is it possible? Can you help us step by step in this task?

Sure, we’ll help you with this.
It’s possible to use a separate ES instance, but it’s quite impossible to make Ambar work with custom types and mappings in ES since Ambar’s core logic is build around it. I think it’s better for you to use Ambar API for integrating it with your current solution/setup. Yesterday we posted about this on our blog, check it out here.

Looking forward to your response. If you have any further questions - please ask!

.


#8

Thank you very much, you can find below my current config json (external domains are managed by HAProxy and proxied to localhost services), next week I’ll come back to set up this part.

I understand the constraints to custom types, but we are indexing documents (called items) with attachments (called enclosures), so we have a simple tree structure I thought to represent in ES using parent-child relations (enclosure --|is child of|--> item). Referring to type names in our current mapping, the enclosure type can be replaced by your custom type, but so item relations will live only in your meta.source_id (because _parent field can be defined in children documents, not in parents ones).

Our final goal is searching for items using also queries on enclosures (the ES has_child query). I prefer to avoid an explicit join after the query, performed at application level, ie. searching for enclosures, collect source_id fields, query for items from source_ids, join the datasets and rebuild the item -> enclosure tree.

Of course we can keep ambar stack as a standalone service: at index time enclosures documents are created with our mapping, but with an empty content. Then an asyncronous process sends new pdfs to ambar and update enclosures with parsed results when ready. But parsed results can be complex, so I should replicate their structure to keep in sync the two indexes, the ambar one and the enclosure one. Hmm…

{
    "ocr": {
        "pdfMaxPageCount": 5000,
        "pdfSymbolsPerPageThreshold": 100
    },
    "es": {
        "containerSize": "2g",
        "heapSize": "1g",
        "port": "8200"
    },
    "dataPath": "/opt/ambar/data",
    "api": {
        "crawlerCount": 1,
        "local": {
            "host": "localhost",
            "protocol": "http",
            "port": "8184"
        },
        "showFilePreview": "false",
        "cacheSize": "1g",
        "pipelineCount": 1,
        "auth": "none",
        "defaultLangAnalyzer": "ambar_it",
        "analyticsToken": "cda4b0bb11a1f32aed7564b08c455992",
        "external": {
            "host": "apiambar.ricostruzionetrasparente.it",
            "protocol": "http",
            "port": "80"
        },
        "mode": "ce"
    },
    "dropbox": {
        "clientId": "",
        "redirectUri": ""
    },
    "dockerRepo": "ambar",
    "fe": {
        "local": {
            "host": "localhost",
            "protocol": "http",
            "port": "8181"
        },
        "external": {
            "host": "ambar.ricostruzionetrasparente.it",
            "protocol": "http",
            "port": "80"
        }
    },
    "db": {
        "cacheSizeGb": 2
    },
    "dockerComposeTemplate": "https://static.ambar.cloud/docker-compose.template.yml"
}

#9

I wouldn’t even consider changing Ambar mappings since it’ll definitely ruin the core logic.

I think this asynchronous content parsing with Ambar can be the solution. You can find a basic description of Ambar API here. If you have any questions about using it - please ask us.

Your config seems ok, I think in this case it’s the proxy setup that causes the problem.


#10

Dear @ambar,

eureka, it runs! :slight_smile: But with some issues… :frowning:

Here is the website: http://archivia.ricostruzionetrasparente.it. At the end of this message you can find the config.json (I added my mod for custom ES port). In the last release you removed local settings, now my proxy serves ambar frontend on archivia subdomins and ambar api on api.archivia subdomain. BUT same services are also served directly by ambar on 8181 and 8184 ports… If I use localhost in both external settings, all browser calls are broken. You have also added a proxy component in the docker template file, maybe I have to tune those settings?

Anyway, using GUI and manually uploading files, ambar works like a charm, but I have found some problem using APIs. Here the most common problem: cannot read property 'toLowerCase' of undefined.

Download calls work as expected, but POST call to api/files endpoint raises the same error: curl -XPOST localhost:8184/api/files/Test/6a0c2f5a-e03a-3869-bdea-9f860719885d.pdf -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW' -F file=@/path/to/file/6a0c2f5a-e03a-3869-bdea-9f860719885d.pdf.

Thank you!

{
"api": {
    "external":{
        "host": "api.archivia.ricostruzionetrasparente.it",
        "protocol": "http",
        "port": "8184"
    },
    "analyticsToken": "cda4b0bb11a1f32aed7564b08c455992",
    "auth": "basic",
    "mode": "ce",
    "defaultLangAnalyzer": "ambar_it",
    "pipelineCount": 1,
    "crawlerCount": 1,
    "cacheSize": "1g",
    "showFilePreview": "false"
},
"db": {
    "cacheSizeGb": 2
},
"fe": {
    "external":{
        "host": "archivia.ricostruzionetrasparente.it",
        "protocol": "http",
        "port": "8181"
    }
},
"es": {
    "external":{
        "host": "localhost",
        "protocol": "http",
        "port": "8200"
    },
    "heapSize": "1g",
    "containerSize": "2g"
},
"ocr": {
    "pdfMaxPageCount": 5000,
    "pdfSymbolsPerPageThreshold": 100
},
"dropbox": {
    "clientId": "",
    "redirectUri": ""
},
"dockerRepo": "ambar",
"dockerComposeTemplate": "",
"dataPath": "/opt/ambar"
}

#11

Hello @jenkin

Nice to hear you’ve got it working!

As for your questions:

Are you running Ambar and proxy on the same machine? The best practice is to run it in a safe environment and proxy requests to it from the separate “border” server, that would solve your issue.
Btw, now you can run both api and frontend on the same port, so you’d have to only proxy one port to Ambar. Just specify the same port and host for both api and frontend in your config.json.[quote=“jenkin, post:10, topic:123”]
Anyway, using GUI and manually uploading files, ambar works like a charm, but I have found some problem using APIs. Here the most common problem: cannot read property ‘toLowerCase’ of undefined.

Download calls work as expected, but POST call to api/files endpoint raises the same error: curl -XPOST localhost:8184/api/files/Test/6a0c2f5a-e03a-3869-bdea-9f860719885d.pdf -H ‘content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW’ -F file=@/path/to/file/6a0c2f5a-e03a-3869-bdea-9f860719885d.pdf.
[/quote]

Looks like you’re not sending authentication headers to api, you should send ambar-email and ambar-email-token headers for every request. You can get the token by logging into Ambar. The best way to understand how it works is to track fronted calls in your browser (login and other actions), it’s pretty simple.

However, cannot read property 'toLowerCase' of undefined is a bug (the response code should be 401 instead of 500), we’ll fix it in the next release on the May 11th.


#12

Wow, it works! But now this tutorial is quite misleading… :wink:

Now I’m working on a simple python wrapper for Ambar API, I’ll publish it on Github in next days. But I have few more questions.

  • Can I retrieve a single document, ie. knowing the file name or the file hash, using API? Or I have to directly query ES using term or ids (the sha256 checksum of file) queries?
  • On ES you arrange documents in different indices, I think one for each source_id (ie. “Default” and so on), what is the function used to calculate the index name from the source_id?
  • Is it possibile to download pdf or txt using only filename or hash, or I really need the download_uri?

Thank you again!


#13

Fixed!

Great, you can use apiproxy.py from Ambar Crawler to see some methods implementation.

It depends, do you need the source (binary) or the parsed content and meta? The source in stored in GridFS implementation of MongoDB, and the parsed content is stored in ES.

For now, the only way to retrieve sources is to use the ‘download by secure uri’ method, this method doesn’t require authentication headers to be sent to allow direct opening in browser. It’s not safe to let anyone downloading sources just by sha of file name, that’s why we implemented this ‘secure uri’ thing.

If you need to retrieve a parsed content and meta, now the only way to do that is via the search method.

But! In the next release (next week) we’ll add two new methods to the api (both requiring auth headers):

  • download source by sha
  • download json with parsed content and meta by sha or meta id

Nope. In cloud mode (which is only used in our Ambar Cloud http://app.ambar.cloud) it’s index per user.
In CE or EE mode it’s a single index for every user and every source. The index name is calculated based on the default user, in your case id should be ambar_file_data_d033e22ae348aeb5660fc2140aec35850c4da997

See previous answer.

Thank you for your replies by the way! It’s really important for us to have a live communication with our users, especially with engineers like you.


#14

Uh, great news! :smiley:

With such a simple method to retrieve parsed content of single documents from Ambar I can keep in sync my original database (with parent - child relationships between items and attachments) using sha as foreign key… :slight_smile:

Because now I have a process that harvests and indexes documents from several sources (RSS feeds with items and enclosures) and download enclosures (usually pdf). Next I would another process to take not-parsed-yet documents, submit them to Ambar, pick up parsed content and update original documents. All asynchronously, of course.

I’ll keep you up to date, thanks!


#15

We’ve added these methods!

The upload file method (POST /api/files/:sourceId/:filename) now returns a json with metaId value.

The new methods are (all require authentication headers to be passed):

  • GET /api/files/direct/:metaId/source - returns a source file by metaId
  • GET /api/files/direct/:metaId/text - returns a parsed text from by metaId
  • GET /api/files/direct/:metaId/meta - returns a json object with a metadata by metaId

The methods return 404 in case the file is not found or not processed yet.

Just sudo ./ambar.py update to enable this functionality in your Ambar.


#16

Wow! Here is the first draft of the wrapper: https://github.com/RicostruzioneTrasparente/ambar_python_client :slight_smile: