LT4HALA

LT4HALA 2022

--Home-- --CFP-- --EvaLatin-- --EvaHan-- --Program-- --Organization--

EvaHan

Introduction
Important Dates
Data
- Training Data
- Test Data
Evaluation
How to participate

INTRODUCTION

EvaHan 2022 is the first campaign totally devoted to the evaluation of Natural Language Processing (NLP) tools for the Ancient Chinese language. The Ancient Chinese language is dated back around 1000BC-221BC.

EvaHan first edition has one task (i.e. a joint task of Word Segmentation and POS Tagging).

EvaHan 2022 is organized by Bin Li, Yiguo Yuan, Minxuan Feng, Chao Xu, Dongbo Wang.

IMPORTANT DATES

20 December 2021: training data available
Evaluation Window
- 31 March 2022: test data available
- 6 April 2022: system results due to organizers
26 April 2022: reports submission via SoftConf
10 May 2022: short report review deadline
24 May 2022: camera ready version submission via SoftConf

DATA

Data Sets	Data name	Sources	Word Tokens	Char Tokens
Train	Zuozhuan_Train	Zuozhuan	166,142	194,995
Test A	Zuozhuan_Test	Zuozhuan	28,131	33,298
Test B	Blind_Test	Other similar ancient ChineseBook	Around 40,000	Around 50,000

Training Data

Download training data: zuozhuan_train_utf8.zip

Test Data

Download test data: EvaHan_testa_raw.txt and EvaHan_testb_raw.txt

Download gold data: EvaHan_testa_gold.txt and EvaHan_testb_gold.txt

Test data will be provided in raw format, only Chinese characters and punctuation. The gold standard test data, that is the annotation used for the evaluation, will be provided to the participants after the evaluation. There are two test data sets. Test A is designed to see how a system perform on the data from the same book. Zuozhuan_Test is extracted from Zuozhuan, not overlapping with Zuozhuan_Train. Test B is designed to see how a system performs on similar data (texts of similar content but from a different book). Blind_Test has not been released publicly. The size of it is similar to Zuozhuan_Test. The details of the test data will be provided to the participants after the evaluation.

EVALUATION

FINAL script (revised for some new kinds of format errors that occurred in participants’ files): eval_EvaHan_2022_FINAL.py

The old scorers are still available:

first version: EvaHan_scorer.zip
second version: eval_EvaHan_2022.py

HOW TO PARTICIPATE

Each participant can submit runs following two modalities. In the closed modality, the resources each team could use are limited. Each team only use the Training data Zuozhuan_Train, and the pretrained model SIKU-Roberta. It is word embeddings pretrained on a very large corpus of traditional Chinese collection, Siku Quanshu(四库全书). Other resources are not allowed in the closed modality. In the open modality, there is no limit on the resources, data and models. Annotated external data, such as the components or Pinyin of the Chinese characters, word embeddings can be employed. But each team has to state all the resources, data and models they use in each system in the final report.

For detailed information, please read the guidelines.

Back to the Main Page