A Flagship Initiative of CICT CICT-ன் ஒரு முதன்மைத் திட்டம்

Digital Library of
Tamil Palm-Leaf
Manuscripts தமிழ் ஓலைச் சுவடிகளின்
இணைய நூலகம்

865 Manuscript Bundles (approx. 1,00,000 leaves). 41 Classical Tamil texts. Collected from institutions across India and worldwide. Now powered by Computer Vision OCR — upload a palm-leaf image, extract Tamil script, and connect to the living lexicon. 865 ஓலைச்சுவடிக் கட்டுகள் (சுமார் 1,00,000 ஏடுகள்). 41 செம்மொழித் தமிழ் நூல்கள். இந்தியா மற்றும் உலகெங்கிலும் உள்ள நிறுவனங்களிலிருந்து சேகரிக்கப்பட்டவை. இப்போது கணினிப் பார்வை OCR தொழில்நுட்பத்தின் மூலம் இயக்கப்படுகிறது — ஓலைச்சுவடிப் படத்தைப் பதிவேற்றவும், தமிழ் எழுத்துக்களைப் பிரித்தெடுக்கவும், உயிருள்ள சொற்களஞ்சியத்துடன் இணைக.

865

Manuscript Bundles
approx. 1,00,000 leaves சுவடிக் கட்டுகள்
சுமார் 1,00,000 ஓலைகள்

Classical Texts செம்மொழி நூல்கள்

Source Institutions மூல நிறுவனங்கள்

Critical Editions செம்பதிப்புகள்

865

Manuscript Pothi Bundles சுவடி கட்டுகள்

1,08,655

Lexical Words Indexed அகராதிச் சொற்கள்

2,75,262

Citations Documented மேற்கோள்கள்

6,758

Lexicon Entries அகராதி உள்ளீடுகள்

Computer Vision · AI-Powered கணினிப் பார்வை · செயற்கை நுண்ணறிவு இயக்கம்

Manuscript OCR Engine சுவடி OCR பொறி

Upload a photo of a palm-leaf manuscript or old print text. A transformer-based pipeline (HTR-VT + TrOCR) recognises Tamil characters with up to 50% lower character error rate than legacy engines, then links recognised passages to lexicon entries automatically. Fine-tuned on CICT's six critical editions as ground-truth. ஓலைச் சுவடி அல்லது பழைய அச்சுப் படத்தைப் பதிவேற்றுங்கள். டிரான்ஸ்ஃபார்மர் அடிப்படையிலான பொறி (HTR-VT + TrOCR) தமிழ் எழுத்துக்களை 50% குறைவான பிழை விகிதத்துடன் அடையாளம் காண்கிறது, பின்னர் சொற்களை அகராதியுடன் தானாக இணைக்கிறது. CICT-ன் ஆறு செம்பதிப்புகளில் பயிற்சி பெற்றது.

Initialising engine...

Drop Manuscript Image

Supports palm-leaf scans, field inscription photos, TIFF or JPEG. Drag & drop or click to browse.

HTR-VT TrOCR ViT Encoder Tamil Fine-tune

◉ Field Researcher Mode — Works offline. Ideal for documenting inscriptions, copper-plate grants, or temple epigraphs in the field. Results automatically link to the 6,758-entry Classical Tamil Lexicon.

⟁ Vision Transformer Pipeline ⟁ விஷன் டிரான்ஸ்ஃபார்மர் பொறி

CER ↓ 50%

Pre-process

→

Layout · CNN

→

HTR-VT

→

TrOCR

→

Lexicon Link

Char. Error Rate

1.60

↓ vs Tesseract 3.20

Training Lines

4,037

6 editions · 41 texts

Architecture

ViT + SAM

span-mask regularised

Based on: Li et al., HTR-VT: Handwritten Text Recognition with Vision Transformer, Pattern Recognition 158:110967 (2025) · Li et al., TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models (AAAI 2023). Fine-tuned on CICT's six Classical Tamil critical editions as ground-truth witness data.

Extracted Text · Lexicon-Linked

Awaiting image

Tamil characters will appear here. Hover highlighted words to see lexicon entries.

⚡ Active Research · 2025-2026 Roadmap ⚡ தற்போதைய ஆராய்ச்சி · 2025-2026 பாதை வரைபடம்

+30-60% CER ↓ projected

Three frontier techniques under active evaluation against the 19-specimen Ground Truth corpus (CICT-PLM-GT-001 → GT-019). Validated additions promote to the production pipeline above. 19 சுவடிகள் கொண்ட தரப்படுத்தப்பட்ட மெய்ப்படிவுத் தொகுப்பை (CICT-PLM-GT-001 → GT-019) எதிராக மதிப்பீடு செய்யப்படும் மூன்று முன்னோடி நுட்பங்கள். சரிபார்க்கப்பட்டவை மேலே உள்ள உற்பத்திப் பாதைக்கு உயர்த்தப்படும்.

LLM Post-Correction Layer பெரிய மொழி மாதிரி பிழைதிருத்த அடுக்கு

Open-weight LLMs (fine-tuned ByT5, Llama-3) applied as a re-ranking stage atop HTR-VT output. Reported CER reduction: 56-60% on top of base recognition. The lexicon-link stage feeds back as morphological constraints for Tamil-specific correction. திறந்த-எடை LLM-கள் (ByT5, Llama-3 நுட்பப்படுத்தப்பட்டவை) HTR-VT வெளியீட்டின் மீது மறுதரப்படுத்தும் கட்டமாகப் பயன்படுத்தப்படுகின்றன. பதிவு செய்யப்பட்ட CER குறைப்பு: அடிப்படை அடையாளத்தின் மீது 56-60%.

Kim et al., arXiv:2502.01205 (Feb 2025) Boroş et al., arXiv:2504.00414 (Apr 2025) Li, arXiv:2410.24034 (Oct 2024)

Vision-Language Model End-to-End OCR காட்சி-மொழி மாதிரி நிறை-முதல்-நிறை OCR

Multimodal OCR models bypass the segmentation stage entirely, processing whole folios in one pass. Particularly effective for low-resource historical scripts. Candidates: CHURRO (open-weight VLM trained for historical text recognition), olmOCR-2-7B (Allen AI; Qwen2.5-VL fine-tuned with GRPO RL on 270K pages), Qwen3-VL, PaddleOCR-VL (109 languages), DeepSeek-OCR (~100 languages). பல-முறைமை OCR மாதிரிகள் பகுப்பாய்வுக் கட்டத்தை முழுமையாகத் தவிர்த்து, முழு ஓலைகளையும் ஒரே சுற்றில் செயலாக்குகின்றன. குறைந்த-வளம் கொண்ட வரலாற்றுப் பாடநூல்களுக்கு குறிப்பாக பயனுள்ளது.

CHURRO, arXiv:2509.19768 (Sep 2025) olmOCR-2-7B-1025 (Allen AI, Oct 2025) PaddleOCR-VL · DeepSeek-OCR · Qwen3-VL (2025) Manchu VLM-OCR, arXiv:2507.06761 (Jul 2025)

Palm-Leaf Specialised Pipeline ஓலைச்சுவடிக்கான சிறப்பு பாதை

Domain-specific advances designed for palm-leaf material: PLM-Res-U-Net for binarization (handles palm-leaf texture variation that breaks generic methods), PALM-LAY for layout analysis (6-script benchmark dataset including Tamil, with 566 pages and 6,000+ annotated regions), and hybrid Indo-Aryan/Tamil classification for mixed-script bundles. ஓலைச்சுவடி பொருளுக்காக வடிவமைக்கப்பட்ட களம்-சார்ந்த முன்னேற்றங்கள்: PLM-Res-U-Net பைனரைசேஷனுக்கு (ஓலைச்சுவடி அமைப்பு வேறுபாடுகளை கையாளுகிறது), PALM-LAY அமைப்பு பகுப்பாய்வுக்கு (தமிழ் உட்பட 6 எழுத்துக்களை உள்ளடக்கிய தரநிலை தரவுத்தொகுப்பு).

PLM-Res-U-Net, J.Comp.Hum. (2025) PALM-LAY, Thuon et al., ICDAR 2025 Workshops Dinesh et al., Sci.Rep. 15 (Nov 2025)

Status: Each track will be benchmarked against the GT-001 → GT-019 corpus once additional ground-truth verification rounds are complete. The 19-specimen unbroken sequence (kurals 1–190) provides 190 verified couplets across two scripts (Tamil + Grantha) and 12 documented paleographic features — exactly the kind of fine-grained benchmark needed to measure modern HTR/VLM systems' real performance on degraded palm-leaf material. நிலை: ஒவ்வொரு பாதையும் கூடுதல் தரப்படுத்தப்பட்ட சரிபார்ப்பு சுற்றுகள் முடிந்த பிறகு GT-001 → GT-019 தொகுப்பை எதிராக அளவிடப்படும். 19-சுவடித் தடையற்ற தொடர் (குறள்கள் 1-190) 190 சரிபார்க்கப்பட்ட குறட்பாக்களை இரண்டு எழுத்துக்களில் (தமிழ் + கிரந்தம்) வழங்குகிறது.

CICT Tirukkural Ground Truth Corpus

Ground Truth Corpus · Tamil HTR Training Specimens ஞானநிலை மெய்ப்பொருள் சேகரம் · தமிழ் HTR பயிற்சி மாதிரிகள்

Towards a Complete Tirukkural Corpus முழு திருக்குறள் சேகரம் நோக்கி

133 chapters · 1,330 kurals · published progressively as scholars complete verified transcriptions. Click any published tile to see the full specimen. 133 அதிகாரங்கள் · 1,330 குறள்கள் · அறிஞர்கள் சரிபார்க்கப்பட்ட பிரதிகளை முடிக்கும்போது படிப்படியாக வெளியிடப்படுகின்றன.

—

of 133 chapters

—

of 1,330 kurals

—

complete

Loading featured specimen… முக்கிய மாதிரி ஏற்றப்படுகிறது…

Browse the corpus சேகரத்தை உலாவவும்

Reference Texts மூல உரைகள்

Canonical Reference Editions செம்பதிப்பு மூல நூல்கள்

Published critical editions of classical Tamil works, embedded as canonical reference data alongside the manuscript collection. Each edition supplies the authoritative source-text against which palm-leaf manuscript folios of the same work can be compared once added to the corpus — exactly as the canonical Tirukkural text serves the manuscript folios in the Ground Truth section above. தமிழ் செவ்வியல் நூல்களின் வெளியிடப்பட்ட செம்பதிப்புகள், சுவடிச் சேகரிப்புடன் சேர்ந்து செம்பதிப்பு மூல உரைகளாகச் சேமிக்கப்பட்டுள்ளன. ஒவ்வொரு பதிப்பும் ஒரு நூலின் ஓலைச் சுவடிகள் சேகரிப்பில் சேர்க்கப்படும்போது அவற்றுடன் ஒப்பிடுவதற்குச் செம்பதிப்பு உரையை வழங்குகிறது — மேற்கண்ட தரப்படுத்தப்பட்ட மெய்ப்படிவுப் பகுதியில் திருக்குறள் செம்பதிப்பு சுவடி எழுத்துக்களுக்குச் செயல்படுவதைப் போல.

◉ Critical Edition · KAIN-CE-001 · Embedded

கைந்நிலை

Kainnilai · Patiṉeṇkīḻkkaṇakku #18

Author:ஆசிரியர்: புல்லங்காடனார் Editor:பதிப்பாசிரியர்: Dr. Se. Karumpāyiram CICT செம்பதிப்பு #6 · 2024 ISBN 978-81-19249-48-0

60 Total poemsமொத்தப் பாடல்கள்

5 Tinai sectionsதிணைப் பகுதிகள்

41 Completeமுழுமையானவை

19 Fragmentaryசிதைந்தவை

Kainnilai is the eighteenth and final work of the Patiṉeṇkīḻkkaṇakku (Eighteen Lesser Anthologies), a Sangam-adjacent corpus dating to the centuries immediately following the Sangam classical era. The 60 poems are organised into the five canonical akam landscapes (tiṇai) — twelve poems each — exploring love and longing through landscape symbolism. The text passed through extensive scholarly debate (1887–1931) before being established as part of the canonical anthology by I. V. Anantharamaiyar's 1931 first edition. This CICT critical edition (Series #6, 2024) is the first to systematically establish the source-reading by collating palm-leaf manuscripts, paper manuscripts, prior printed editions, and citational quotations across classical commentaries. The 19 fragmentary poems preserve the editor's … lacuna markers exactly as published — these damaged readings reflect the manuscript witnesses' actual condition rather than transcription gaps. கைந்நிலை பதினெண்கீழ்க்கணக்கு தொகுப்பின் பதினெட்டாவது நூல் — சங்க மருவிய காலத்தைச் சேர்ந்தது. 60 பாடல்கள் ஐந்து அக ஒழுக்கத் திணைகளாகப் பகுக்கப்பட்டுள்ளன, ஒவ்வொன்றிலும் பன்னிரு பாடல்கள். இந்தச் செம்பதிப்பு (CICT செம்பதிப்பு வரிசை #6, 2024) ஓலைச் சுவடிகள், தாள் சுவடிகள், முந்தைய பதிப்புகள், உரை மேற்கோள்கள் ஆகியவற்றை ஒப்பிட்டு மூலபாடத்தை முதன்முதலில் முறையாக உறுதிசெய்த பதிப்பு. சிதைந்த 19 பாடல்களில் பதிப்பாசிரியரின் … சிதைவு குறியீடுகள் வெளியிடப்பட்டபடியே பாதுகாக்கப்பட்டுள்ளன.

▸ Browse all 60 poems by tinai ▸ அனைத்து 60 பாடல்களையும் திணை வாரியாகப் பார்க்க

◉ Critical Edition · IRAI-CE-001 · Embedded

இறையனார் களவியல்

Iṟaiyaṉār Kaḷaviyal · Akapporul grammar with Nakkīrar's commentary

Mūlam:மூலம்: இறையனார் · Commentary:உரை: நக்கீரர் Editor:பதிப்பாசிரியர்: Dr. A. Damodaran Associate Editor:இணைப் பதிப்பாசிரியர்: Dr. Se. Karumpāyiram (also KAIN-CE-001) CICT செம்பதிப்பு:1 · 2025 · Pub. No. 233 ISBN 978-93-49646-70-4 · 574 pages

60 Sutras (நூற்பா)நூற்பாக்கள்

18 MSS collatedசுவடிகள் ஒப்பீடு

15 Palm-leaf MSSஓலைச் சுவடிகள்

574 Pagesபக்கங்கள்

Iṟaiyaṉār Kaḷaviyal (also Iṟaiyaṉār Akapporuḷ) is the foundational treatise on akam grammar — the poetic conventions for clandestine love (kaḷavu) in classical Tamil literature. Its 60 nūṟpā (sutras) are attributed to Iṟaiyaṉār (a legendary author traditionally identified with Lord Śiva), with the surviving prose commentary composed by Nakkīrar of the Last Sangam — the earliest surviving Tamil prose commentary, predating all other Tamil commentarial literature. The work was first printed by C.V. Damodaram Pillai in 1883; subsequent printed editions (Bavanantam Pillai 1916, Govindaraja Mudaliyar 1939, Namasivaya Mudaliyar 1943, Saiva Siddhantha Press 1953) presented divergent readings based on whichever manuscripts each editor possessed. This CICT critical edition (செம்பதிப்பு:1, 2025) is the first to systematically collate 18 manuscripts (15 palm-leaf + 3 paper) along with all prior printed editions and citation-witnesses to establish a definitive source-reading. The sutras preserve a remarkable mixed-numeral system: most use Tamil numerals (௧, ௩, ௪…), but sutras 2, 7, and 8 use the classical vowel-letter numerals உ, எ, அ respectively — a paleographic feature faithfully reproduced from the source manuscripts. The associate editor Dr. Se. Karumpāyiram also edited KAIN-CE-001 (Kainnilai), establishing CICT's growing critical-edition series. இறையனார் களவியல் (இறையனார் அகப்பொருள் என்றும் வழங்கப்படும்) தமிழ் இலக்கணத்தில் அகம் குறித்த அடிப்படை இலக்கண நூலாகும். 60 நூற்பாக்கள் இறையனார் என்னும் ஆசிரியருக்கு (சிவபெருமான் என்று மரபு கூறும்) உரியன; எஞ்சியுள்ள உரை கடைச் சங்கத்துத் தலைமைப் புலவர் நக்கீரர் இயற்றியது — தமிழில் காலத்தால் முந்தைய உரை இதுவே. முதன்முதலில் 1883-ஆம் ஆண்டு சி.வை. தாமோதரம் பிள்ளை அவர்களால் பதிப்பிக்கப்பட்டது. இச்செம்பதிப்பில் (செம்பதிப்பு:1, 2025) முதன்முதலாக 18 சுவடிகள் (15 ஓலை + 3 தாள்) ஒப்புநோக்கப்பட்டுள்ளன. நூற்பா எண் 2, 7, 8 ஆகியவை முறையே உ, எ, அ எனும் செவ்வியல் உயிர்-எழுத்து எண்களைப் பயன்படுத்துகின்றன — மூலச் சுவடியின் எழுத்து வடிவம் அப்படியே பாதுகாக்கப்பட்டுள்ளது.

▸ Browse all 60 sutras (mūlam) + sample commentary on Sutra 1 ▸ அனைத்து 60 நூற்பாக்களையும் காண (மூலம்) + சூத்திரம் 1 உரை மாதிரி

Six-Stage Pipeline ஆறு-கட்ட வழித்தடம்

Digital Library Roadmap இணைய நூலகப் பாதை வரைபடம்

From physical acquisition to globally-accessible open repository — six stages transforming 865 Manuscript Bundles into a living scholarly platform. இயற்பியல் சேகரிப்பு முதல் உலகளாவிய திறந்த களஞ்சியம் வரை — 865 சுவடிகளை உயிர்ப்பான கல்வித் தளமாக மாற்றும் ஆறு கட்டங்கள்.

Collect

865 palm-leaf and paper manuscripts collected over 15 years from 15 institutions across India and worldwide.

✓ Complete

Digitize

High-resolution TIFF images created for all 865 Manuscript Bundles at 600 dpi archival standard. Foundation complete.

✓ Complete

Catalogue

Priority 2026–27: Metadata preparation for every manuscript — title, scribe, era, script, folio count, condition.

⟳ In Progress

Transcribe

Train HTR models on Tamil palm-leaf scripts. Six critical editions serve as ground-truth training data.

◯ Upcoming

Annotate

Scholarly apparatus — variant readings, glossary, cross-references, commentary linked to lexicon entries.

◯ Upcoming

Publish

Open, AI-enabled Digital Library: globally accessible, searchable, IIIF-compliant, and interoperable.

◯ Upcoming

Journey Through Time காலப் பயணம்

2,500+ Years of Tamil தமிழின் 2,500+ ஆண்டுகள்

Drag the scrubber to travel through Tamil literary history — from the Sangam Age to the AI-enabled Digital Library of today. See which works existed at any moment across the ages. தமிழ் இலக்கிய வரலாற்றின் வழியாகப் பயணிக்க காலவரிசை நகர்த்தியை நகர்த்துங்கள் — சங்கக் காலத்திலிருந்து இன்றைய செயற்கை நுண்ணறிவு ஆதரித்த மின்னணு நூலகம் வரை. காலங்களின் ஒவ்வொரு கட்டத்திலும் எந்தெந்த நூல்கள் இருந்தன என்பதைப் பாருங்கள்.

Current Era 2026CE

The Digital Age

CICT has established a fully AI-enabled Digital Library of 865 palm-leaf manuscript bundles, with transformer-based HTR engines, IIIF interoperability, and global scholarly reach.

41 Tamil texts in existence

865 Pothi Bundles

2026 CE

Jump to

Curated Journeys தெரிந்தெடுத்த பயணங்கள்

Discover by Theme கருப்பொருள் வழியே ஆராய்

Don't know where to start? Tell us what interests you — love poetry, sea voyages, kings and battles, ancient grammar — and a curated gallery of relevant manuscripts will be assembled for you, with a scholarly introduction. எங்கிருந்து தொடங்குவது என்று தெரியவில்லையா? உங்களுக்கு என்ன ஆர்வமூட்டுகிறது என்று சொல்லுங்கள் — காதல் கவிதை, கடல் பயணங்கள், அரசர்களும் போர்களும், பழைய இலக்கணம் — தொடர்புடைய சுவடிகளின் தொகுப்பு கல்வியியல் அறிமுகத்துடன் உருவாக்கப்படும்.

Begin your journey பயணத்தைத் தொடங்கு

✦

Curated Theme

Theme · subtitle

தமிழ் தலைப்பு

0 Manuscripts
in this journey

Curator's note

865 Manuscript Pothi Bundles · 41 Texts 865 சுவடி கட்டுகள் · 41 நூல்கள்

Browse the Manuscript Collection சுவடிச் சேகரிப்பில் உலாவு

The complete corpus of Classical Tamil texts in CICT's Pavendhar Library. Filter by literary category or search by title. CICT பாவேந்தர் நூலகத்தில் உள்ள செம்மொழித் தமிழ் நூல்களின் முழுத் தொகுப்பு. இலக்கிய வகையால் வடிகட்டவும் அல்லது தலைப்பால் தேடவும்.

Scope & Provenance வீச்சு & மூலம்

Collection Overview சேகரிப்பு மேற்பார்வை

Source Institutions · Geographic View மூல நிறுவனங்கள் · புவியியல் காட்சி

Manuscripts converged on CICT Chennai from 15 institutions across India and France. Hover any marker or list entry to see details. இந்தியா மற்றும் பிரான்ஸில் உள்ள 15 நிறுவனங்களிலிருந்து CICT சென்னைக்கு சுவடிகள் வந்தன.

Institutions

3+1

States · Country

865

Bundles

CICT (host)

Tamil Nadu source

Other / international

Distribution by Category

Collection Profile

Material: Palm Leaf (primary) · Paper (secondary)

Script: Tamil · Tamil-Grantha

Age: 600 AD and earlier

Digitization: TIFF 600 dpi · Complete

Metadata: In preparation 2026–27

Critical Editions: 6 published (ground truth for HTR)

Geographic spread: From 15 source institutions across 3 Indian states + France — Tamil Nadu (12), Andhra Pradesh (1), West Bengal (1), France (1). Chennai and Thanjavur form the primary clusters, with Bibliothèque nationale de France representing the international diaspora.

Central Institute of Classical Tamil செம்மொழித் தமிழாய்வு மத்திய நிறுவனம்

About CICT CICT பற்றி

47,450

Total volumes in the Pavendhar Library, including 4,800 rare Classical Tamil books.

3,923

Non-book materials, digitized CDs and DVDs in the collection.

50+

Languages into which Classical Tamil works have been translated by CICT.

CICT is an autonomous institution under the Ministry of Education, Government of India. Located at Chemmozhi Salai, Perumbakkam, Chennai — 600 100. The Digital Library of Tamil Palm-Leaf Manuscripts is a flagship programme of CICT and forms a core component of the proposed AI Centre for Classical Tamil (AI-CCT).

Digital Library ofTamil Palm-LeafManuscripts தமிழ் ஓலைச் சுவடிகளின்இணைய நூலகம்

Manuscript OCR Engine சுவடி OCR பொறி

Towards a Complete Tirukkural Corpus முழு திருக்குறள் சேகரம் நோக்கி

Browse the corpus சேகரத்தை உலாவவும்

CICT Lexical Tools CICT சொல்வளக் கருவிகள்

Canonical Reference Editions செம்பதிப்பு மூல நூல்கள்

கைந்நிலை

Kainnilai · Patiṉeṇkīḻkkaṇakku #18

இறையனார் களவியல்

Iṟaiyaṉār Kaḷaviyal · Akapporul grammar with Nakkīrar's commentary

Digital Library Roadmap இணைய நூலகப் பாதை வரைபடம்

2,500+ Years of Tamil தமிழின் 2,500+ ஆண்டுகள்

Discover by Theme கருப்பொருள் வழியே ஆராய்

Browse the Manuscript Collection சுவடிச் சேகரிப்பில் உலாவு

Collection Overview சேகரிப்பு மேற்பார்வை

Source Institutions · Geographic View மூல நிறுவனங்கள் · புவியியல் காட்சி

Distribution by Category

Collection Profile

About CICT CICT பற்றி

Digital Library of
Tamil Palm-Leaf
Manuscripts தமிழ் ஓலைச் சுவடிகளின்
இணைய நூலகம்