Taking Scanned PDF of county budgets for the fiscal year with data in tables.
Using Amazon Textract to get the data out of pdf
You can select the type of extraction “features” you want based on your data. I chose extract tables.
You upload, choose feature, process, download, and then use api in production if desired.
Cost is cheap: $1.50 per 1,000 pages.
Download format was about 60 csv files for 60 page pdf
Textract gives the table in csv format and then another table generated below with a % confidence in each items accuracy.
Had to re-enable WSL2 on windows to let me run bash scripts on the data