ray/doc/source/data/examples/ocr_example.ipynb
Philipp Moritz 081bbfbff1
[Examples] Test OCR example in documentation tests (#26482)
Make sure the OCR example is tested in documentation after we discovered that example notebooks are not tested in CI.

Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
2022-07-16 10:51:28 -07:00

547 lines
22 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "code",
"execution_count": 4,
"id": "905f9cad",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# flake8: noqa\n",
"import warnings\n",
"import os\n",
"\n",
"# Suppress noisy requests warnings.\n",
"warnings.filterwarnings(\"ignore\")\n",
"os.environ[\"PYTHONWARNINGS\"] = \"ignore\""
]
},
{
"cell_type": "markdown",
"id": "6945c179",
"metadata": {},
"source": [
"# Scaling OCR using Ray Datasets\n",
"\n",
"In this example, we will show you how to run optical character recognition (OCR) on a set of documents and analyze the resulting text with the natural language processing library [spaCy](https://spacy.io/). Running OCR on a large dataset is very computationally expensive, so using Ray for distributed processing can really speed up the analysis. Ray Datasets makes it easy to compose the different steps of the pipeline, namely the OCR and the natural language processing. Ray Datasets' actor support also allows us to be more efficient by sharing the spaCy NLP context between several datapoints.\n",
"\n",
"To make it more interesting, we will run the analysis on the [LightShot](https://www.kaggle.com/datasets/datasnaek/lightshot) dataset. It is a large publicly available OCR dataset with a wide variety of different documents, all of them screenshots of various forms. It is easy to replace that dataset with your own data and adapt the example to your own use cases!\n",
"\n",
"## Overview\n",
"\n",
"This tutorial will cover:\n",
" - Creating a Ray Dataset that represents the images in the dataset\n",
" - Running the computationally expensive OCR process on each image in the dataset in parallel\n",
" - Filtering the dataset by keeping only images that contain text\n",
" - Performing various NLP operations on the text\n",
"\n",
"## Walkthrough\n",
"\n",
"Let's start by preparing the dependencies and downloading the dataset. First we install the OCR software `tesseract` and its Python client:\n",
"\n",
"````{tabbed} macOS\n",
"```\n",
"brew install tesseract\n",
"pip install pytesseract\n",
"```\n",
"````\n",
"\n",
"````{tabbed} linux\n",
"```\n",
"sudo apt-get install tesseract-ocr\n",
"pip install pytesseract\n",
"```\n",
"````\n",
"\n",
"By default, the following example will run on a tiny dataset we provide. If you want to run it on the full dataset, we recommend to run it on a cluster since processing all the images with tesseract takes a lot of time.\n",
"\n",
"````{note}\n",
"If you want to run the example on the full [LightShot](https://www.kaggle.com/datasets/datasnaek/lightshot) dataset, you need to download the dataset and extract it. You can extract the dataset by first running `unzip archive.zip` and then `unrar x LightShot13k.rar .` and then you can upload the dataset to S3 with `aws s3 cp LightShot13k/ s3://<bucket>/<folder> --recursive`.\n",
"````"
]
},
{
"cell_type": "markdown",
"id": "c08612ac",
"metadata": {},
"source": [
"Let's now import Ray and initialize a local Ray cluster. If you want to run OCR at a very large scale, you should run this workload on a multi-node cluster."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "37f22aa8",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-07-04 14:35:19,444\tINFO services.py:1476 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n"
]
},
{
"data": {
"text/plain": [
"RayContext(dashboard_url='127.0.0.1:8265', python_version='3.7.4', ray_version='1.13.0', ray_commit='e4ce38d001dbbe09cd21c497fedd03d692b2be3e', address_info={'node_ip_address': '127.0.0.1', 'raylet_ip_address': '127.0.0.1', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2022-07-04_14-35-16_950060_89285/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-07-04_14-35-16_950060_89285/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2022-07-04_14-35-16_950060_89285', 'metrics_export_port': 60416, 'gcs_address': '127.0.0.1:61663', 'address': '127.0.0.1:61663', 'node_id': 'b6c981243d51558d13e4290f0f63552a6126f8a8d9e472baafe9dd5b'})"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Import ray and initialize a local Ray cluster.\n",
"import ray\n",
"ray.init()"
]
},
{
"cell_type": "markdown",
"id": "ee90daa8",
"metadata": {},
"source": [
"### Running the OCR software on the data\n",
"\n",
"We can now use the {meth}`ray.data.read_binary_files <ray.data.read_binary_files>` function to read all the images from S3. We set the `include_paths=True` option to create a dataset of the S3 paths and image contents. We then run the {meth}`ds.map <ray.data.Dataset.map>` function on this dataset to execute the actual OCR process on each file and convert the screen shots into text. This will create a tabular dataset with columns `path` and `text`, see also [](transform_datasets_row_output_types).\n",
"\n",
"````{note}\n",
"If you want to load the data from a private bucket, you have to run\n",
"```python\n",
"import pyarrow.fs\n",
"\n",
"ds = ray.data.read_binary_files(\"s3://<bucket>/<folder>\",\n",
" include_paths=True,\n",
" filesystem=pyarrow.fs.S3FileSystem(\n",
" access_key=\"...\",\n",
" secret_key=\"...\",\n",
" session_token=\"...\"))\n",
"```\n",
"````"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d31d3303",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-07-04 14:35:53,683\tWARNING read_api.py:256 -- The number of blocks in this dataset (3) limits its parallelism to 3 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.\n",
"Read->Map: 100%|██████████| 3/3 [00:07<00:00, 2.34s/it]\n"
]
}
],
"source": [
"from io import BytesIO\n",
"from PIL import Image\n",
"import pytesseract\n",
"\n",
"def perform_ocr(data):\n",
" path, img = data\n",
" return {\n",
" \"path\": path,\n",
" \"text\": pytesseract.image_to_string(Image.open(BytesIO(img)))\n",
" }\n",
"\n",
"ds = ray.data.read_binary_files(\n",
" \"s3://anonymous@air-example-data/ocr_tiny_dataset\",\n",
" include_paths=True)\n",
"\n",
"results = ds.map(perform_ocr)"
]
},
{
"cell_type": "markdown",
"id": "e22e7cd7",
"metadata": {},
"source": [
"Let us have a look at some of the data points with the {meth}`take <ray.data.Dataset.take>` function."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "5518b831",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/gnome_screenshot.png',\n",
" 'text': '= Cancel\\n\\nTake Screenshot\\n© Grab the whole screen\\n\\nGrab the current window\\n\\n|_| eeeeeeter\\n\\nGrab after a delay of 0\\n\\nEffects\\nInclude pointer\\n\\n¥ Include the window border\\n\\nApply effect: None Sa\\n\\n+. seconds\\n'}),\n",
" ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/miranda_screenshot.png',\n",
" 'text': '© Viktor (Online) : Message Session\\n\\n“etto| © Whter | steno\\n\\nremus\\ntet? Fiviha\\n\\n17: dokonca to vie aj video @\\nViktor\\n\\n1818. 55 samozrejme\\n\\n1818: len moj brat to skusal\\nremus\\n\\nWA\\n\\n098003 —\\n\\nseettsgmailcom [0]\\n\\nonline\\n\\nHacemen\\n@ Ce\\n\\nieFFo\\n169 6 je <>vin ©®\\n\\nBe 22\\n\\naway\\n\\nTue\\nhn\\n\\n& Wee\\n\\nYep, Tm here\\n\\n&\\nea\\na\\nLS]\\n\\n'}),\n",
" ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/qemu_screenshot.png',\n",
" 'text': 'File Edit View Bookmarks\\n\\n[i New Tab [If] split view ~\\n\\n43044 kousekip\\n\\nPlugins\\n\\nkousekip:ako-kaede-mirai(htop)\\n\\nkousekip:ako-kaede-mirai(qemu-system-x86)\\n\\nSettings\\n\\nHelp\\n\\nkousekip:ako-kaede-miral(htop) — Konsole vax\\n\\nFl Paste Q Find\\n\\nEMU vax\\n\\nMachine View\\n\\nApplications Places System @)C) Fri Feb 18, 13:56\\n\\nTerminal\\n\\nroot root\\nroot sys\\nroot sys\\nroot sys\\nroot sys\\nroot sys\\nroot root\\nroot sys\\nroot bin\\nroot root\\nroot sys\\nroot root\\nroot sys\\nroot sys\\nroot root\\nroot root\\nroot root\\nroot sys\\nroot root\\nroot sys\\nroot sys\\n2 root —sys\\nkousekip@ako-kaede-mirai-sun:~$ If\\n\\nbin -> ./usr/bin\\nboot\\ndev\\ndevices\\netc\\nexport\\nhome\\nkernel\\nlib\\nmedia\\nmnt\\n\\nnet\\nopt\\nplatform\\nproc\\nroot\\nrpool\\nsbin\\nsystem\\ntmp\\nusr\\nvar\\n\\n@kousekip\\nidesktop\\n\\n©\\n\\n©\\n\\nBUNwnSunennh SnuNaeon\\n\\n(Documents\\nDownloads\\nGaMusic\\n\\n5\\n\\nBitrash\\nDevices\\n(Floppy Drive\\nNetwork\\n\\n@ Browse Netw...\\n\\n9\\n9\\n6\\n4\\n9\\n\\n53\\n5\\n6\\n4\\n9\\n10\\n0\\n6\\n18\\n7\\n\\nfovey\\\\aliarel(elare)\\n\\n'})]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results.take(10)"
]
},
{
"cell_type": "markdown",
"id": "67ed5a8d",
"metadata": {},
"source": [
"### Saving and loading the result of the OCR run\n",
"\n",
"````{note}\n",
"Saving the dataset is optional, you can also continue with the in-memory data without persisting it to storage.\n",
"````\n",
"\n",
"We can save the result of running tesseract on the dataset on disk so we can read it out later if we want to re-run the NLP analysis without needing to re-run the OCR (which is very expensive on the whole dataset). This can be done with the {meth}`write_parquet <ray.data.Dataset.write_parquet>` function:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7c2d8abe",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Write Progress: 100%|██████████| 3/3 [00:00<00:00, 207.11it/s]\n"
]
}
],
"source": [
"import os\n",
"results.write_parquet(os.path.expanduser(\"~/LightShot13k_results\"))"
]
},
{
"cell_type": "markdown",
"id": "7a387f42",
"metadata": {},
"source": [
"You can later reload the dataset with the {meth}`read_parquet <ray.data.read_parquet>` function:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "af63be93",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-07-04 14:36:13,515\tWARNING read_api.py:256 -- The number of blocks in this dataset (6) limits its parallelism to 6 concurrent tasks. This is much less than the number of available CPU slots in the cluster. Use `.repartition(n)` to increase the number of dataset blocks.\n"
]
}
],
"source": [
"results = ray.data.read_parquet(os.path.expanduser(\"~/LightShot13k_results\"))"
]
},
{
"cell_type": "markdown",
"id": "f6a7bf0f",
"metadata": {},
"source": [
"### Process the extracted text data with spaCy\n",
"\n",
"This is the part where the fun begins. Depending on your task there will be different needs for post processing, for example:\n",
"- If you are scanning books or articles you might want to separate the text out into sections and paragraphs.\n",
"- If you are scanning forms, receipts or checks, you might want to extract the different items listed, as well as extra information for those items like the price, or the total amount listed on the receipt or check.\n",
"- If you are scanning legal documents, you might want to extract information like the type of document, who is mentioned in the document and more semantic information about what the document claims.\n",
"- If you are scanning medical records, you might want to extract the patient name and the treatment history.\n",
"\n",
"In our specific example, let's try to determine all the documents in the LightShot dataset that are chat protocols and extract named entities in those documents. We will extract this data with spaCy. Let's first make sure the libraries are installed:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69321ee3",
"metadata": {},
"outputs": [],
"source": [
"!pip install \"spacy>=3\"\n",
"!python -m spacy download en_core_web_sm\n",
"!pip install spacy_langdetect"
]
},
{
"cell_type": "markdown",
"id": "b01d2add",
"metadata": {},
"source": [
"This is some code to determine the language of a piece of text:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "ee4cc430",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'language': 'en', 'score': 0.9999976594668697}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"from spacy.language import Language\n",
"from spacy_langdetect import LanguageDetector\n",
"\n",
"nlp = spacy.load('en_core_web_sm')\n",
"\n",
"@Language.factory(\"language_detector\")\n",
"def get_lang_detector(nlp, name):\n",
" return LanguageDetector()\n",
"\n",
"nlp.add_pipe('language_detector', last=True)\n",
"nlp(\"This is an English sentence. Ray rocks!\")._.language"
]
},
{
"cell_type": "markdown",
"id": "95ab0646",
"metadata": {},
"source": [
"It gives both the language and a confidence score for that language.\n",
"\n",
"In order to run the code on the dataset, we should use Ray Datasets' built in support for actors since the `nlp` object is not serializable and we want to avoid having to recreate it for each individual sentence. We also batch the computation with the {meth}`map_batches <ray.data.Dataset.map_batches>` function to ensure spaCy can use more efficient vectorized operations where available:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "85a4a414",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Read progress: 100%|██████████| 6/6 [00:00<00:00, 485.55it/s]\n",
"Map Progress (1 actors 1 pending): 100%|██████████| 6/6 [00:06<00:00, 1.04s/it]\n"
]
},
{
"data": {
"text/plain": [
"Dataset(num_blocks=6, num_rows=6, schema={path: object, text: object, language: object, score: float64})"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import spacy\n",
"from spacy.language import Language\n",
"from spacy_langdetect import LanguageDetector\n",
"\n",
"class SpacyBatchInference:\n",
" def __init__(self):\n",
" self.nlp = spacy.load('en_core_web_sm')\n",
"\n",
" @Language.factory(\"language_detector\")\n",
" def get_lang_detector(nlp, name):\n",
" return LanguageDetector()\n",
"\n",
" self.nlp.add_pipe('language_detector', last=True)\n",
"\n",
" def __call__(self, df):\n",
" docs = list(self.nlp.pipe(list(df[\"text\"])))\n",
" df[\"language\"] = [doc._.language[\"language\"] for doc in docs]\n",
" df[\"score\"] = [doc._.language[\"score\"] for doc in docs]\n",
" return df\n",
"\n",
"results.limit(10).map_batches(SpacyBatchInference, compute=\"actors\")"
]
},
{
"cell_type": "markdown",
"id": "ca995036",
"metadata": {},
"source": [
"We can now get language statistics over the whole dataset:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f64f8b3c",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Read: 100%|██████████| 6/6 [00:00<00:00, 19.95it/s]\n",
"Map Progress (1 actors 1 pending): 100%|██████████| 6/6 [00:05<00:00, 1.09it/s]\n",
"Sort Sample: 100%|██████████| 6/6 [00:00<00:00, 919.27it/s]\n",
"Shuffle Map: 100%|██████████| 6/6 [00:00<00:00, 159.14it/s]\n",
"Shuffle Reduce: 100%|██████████| 6/6 [00:00<00:00, 364.59it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'language': 'af', 'count()': 2}\n",
"{'language': 'en', 'count()': 4}\n"
]
}
],
"source": [
"languages = results.map_batches(SpacyBatchInference, compute=\"actors\")\n",
"languages.groupby(\"language\").count().show()"
]
},
{
"cell_type": "markdown",
"id": "0d638758",
"metadata": {},
"source": [
"````{note}\n",
"On the full LightShot dataset, you would get the following:\n",
"```text\n",
"{'language': 'UNKNOWN', 'count()': 2815}\n",
"{'language': 'af', 'count()': 109}\n",
"{'language': 'ca', 'count()': 268}\n",
"{'language': 'cs', 'count()': 13}\n",
"{'language': 'cy', 'count()': 80}\n",
"{'language': 'da', 'count()': 33}\n",
"{'language': 'de', 'count()': 281}\n",
"{'language': 'en', 'count()': 5640}\n",
"{'language': 'es', 'count()': 453}\n",
"{'language': 'et', 'count()': 82}\n",
"{'language': 'fi', 'count()': 32}\n",
"{'language': 'fr', 'count()': 168}\n",
"{'language': 'hr', 'count()': 143}\n",
"{'language': 'hu', 'count()': 57}\n",
"{'language': 'id', 'count()': 128}\n",
"{'language': 'it', 'count()': 139}\n",
"{'language': 'lt', 'count()': 17}\n",
"{'language': 'lv', 'count()': 12}\n",
"{'language': 'nl', 'count()': 982}\n",
"{'language': 'no', 'count()': 56}\n",
"```\n",
"````"
]
},
{
"cell_type": "markdown",
"id": "9cc5ca11",
"metadata": {},
"source": [
"We can now filter to include only the English documents and also sort them according to their score."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8c4bd03d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Filter: 100%|██████████| 6/6 [00:00<00:00, 561.84it/s]\n",
"Sort Sample: 100%|██████████| 6/6 [00:00<00:00, 1311.81it/s]\n",
"Shuffle Map: 100%|██████████| 6/6 [00:00<00:00, 319.24it/s]\n",
"Shuffle Reduce: 100%|██████████| 6/6 [00:00<00:00, 450.79it/s]\n"
]
},
{
"data": {
"text/plain": [
"[ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/gnome_screenshot.png',\n",
" 'text': '= Cancel\\n\\nTake Screenshot\\n© Grab the whole screen\\n\\nGrab the current window\\n\\n|_| eeeeeeter\\n\\nGrab after a delay of 0\\n\\nEffects\\nInclude pointer\\n\\n¥ Include the window border\\n\\nApply effect: None Sa\\n\\n+. seconds\\n',\n",
" 'language': 'en',\n",
" 'score': 0.9999976791815426}),\n",
" ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/gnome_screenshot.png',\n",
" 'text': '= Cancel\\n\\nTake Screenshot\\n© Grab the whole screen\\n\\nGrab the current window\\n\\n|_| eeeeeeter\\n\\nGrab after a delay of 0\\n\\nEffects\\nInclude pointer\\n\\n¥ Include the window border\\n\\nApply effect: None Sa\\n\\n+. seconds\\n',\n",
" 'language': 'en',\n",
" 'score': 0.9999965244942747}),\n",
" ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/miranda_screenshot.png',\n",
" 'text': '© Viktor (Online) : Message Session\\n\\n“etto| © Whter | steno\\n\\nremus\\ntet? Fiviha\\n\\n17: dokonca to vie aj video @\\nViktor\\n\\n1818. 55 samozrejme\\n\\n1818: len moj brat to skusal\\nremus\\n\\nWA\\n\\n098003 —\\n\\nseettsgmailcom [0]\\n\\nonline\\n\\nHacemen\\n@ Ce\\n\\nieFFo\\n169 6 je <>vin ©®\\n\\nBe 22\\n\\naway\\n\\nTue\\nhn\\n\\n& Wee\\n\\nYep, Tm here\\n\\n&\\nea\\na\\nLS]\\n\\n',\n",
" 'language': 'en',\n",
" 'score': 0.8571411027551514}),\n",
" ArrowRow({'path': 'air-example-data/ocr_tiny_dataset/miranda_screenshot.png',\n",
" 'text': '© Viktor (Online) : Message Session\\n\\n“etto| © Whter | steno\\n\\nremus\\ntet? Fiviha\\n\\n17: dokonca to vie aj video @\\nViktor\\n\\n1818. 55 samozrejme\\n\\n1818: len moj brat to skusal\\nremus\\n\\nWA\\n\\n098003 —\\n\\nseettsgmailcom [0]\\n\\nonline\\n\\nHacemen\\n@ Ce\\n\\nieFFo\\n169 6 je <>vin ©®\\n\\nBe 22\\n\\naway\\n\\nTue\\nhn\\n\\n& Wee\\n\\nYep, Tm here\\n\\n&\\nea\\na\\nLS]\\n\\n',\n",
" 'language': 'en',\n",
" 'score': 0.5714285419353925})]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"languages.filter(lambda row: row[\"language\"] == \"en\").sort(\"score\", descending=True).take(1000)"
]
},
{
"cell_type": "markdown",
"id": "8c05df96",
"metadata": {},
"source": [
"If you are interested in this example and want to extend it, you can do the following for the full dataset:\n",
"- go throught these results in order\n",
"- create labels on whether the text is a chat conversation and then train a model like [Huggingface Transformers](https://huggingface.co/docs/transformers/) on the data.\n",
"\n",
"Contributions that extend the example in this direction with a PR are welcome!"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 5
}