Published on

How to setup Tabula to Extract PDF tables using Docker

Authors

Tabula is a open source tool to extract data tables from PDF files.

I've tried so many cloud based apps to extract tables from PDF and so far nothing is as good as Tabula 🔥

You can even have templates for extracting data as well. So that you can reuse it.

Here is how to setup Tabule using Docker

FROM openjdk:8

ENV TABULA_VERSION 1.2.1

RUN wget -q https://github.com/tabulapdf/tabula/releases/download/v$TABULA_VERSION/tabula-jar-$TABULA_VERSION.zip && \
    unzip tabula-jar-$TABULA_VERSION.zip && \
    rm tabula-jar-$TABULA_VERSION.zip

EXPOSE 8080

CMD ["java", "-Dfile.encoding=utf-8", "-Xms256M", "-Xmx1024M", "-jar", "tabula/tabula.jar"]

And now, you can host this in your cloud provider like Railway

Reference

Happy extracing data!