Skip to the content.

LT4HALA 2024

--Home--  --CFP--  --EvaLatin--  --EvaHan--  --Program--  --Organization--


EvaHan


INTRODUCTION

Important Dates

Data

The EvaHan 2024 dataset is composed of texts from classical sources, notably Siku Quanshu (四库全书), along with other historical texts. The dataset’s processing involved initial automatic punctuation and sentence segmentation. Subsequently, these automatic outputs were corrected and refined by experts in Ancient Chinese language to ensure the highest quality of training data and gold standard texts.

The corpus of Chinese ancient classic texts features diachronicity, spanning thousands of years and covering the four traditional types of Chinese canonical texts, namely Jing (经), Shi (史), Zi (子), and Ji (集).

Data Format

All evaluation data are txt files in Unicode (UTF-8) format. The raw texts only contain characters. After manual annotation, punctuation is added to the text, as shown in Table 1.

Table 1. Example of the Ancient Chinese

Type Example
Raw Text without Punctuation 亟請於武公公弗許
Annotated Text with Punctuation 亟請於武公,公弗許。

Training Data

The training data comprises 10 million characters sourced from the Siku Quanshu. The files are presented in UTF-8 plain text using traditional Chinese script. Training data will be sent to your email after registration.

Test Data

The test data includes approximately 50,000 characters of Ancient Chinese texts. More details will be provided to the participants before the evaluation. Download link will be released soon.

Task

This section offers a detailed description of the tasks encompassed in EvaHan 2024.

Sentence Segmentation and Sentence Punctuation

Sentence segmentation involves converting Chinese text into a sequence of sentences, with each sentence separated by a single space. Additionally, sentence punctuation entails the placement of appropriate punctuation marks at the conclusion of each sentence, as exemplified in Table 2. In numerous Chinese language processing systems, these two processes, sentence segmentation and punctuation, are often addressed together. Consequently, for this shared task, participants are required to automate the transformation of raw text into punctuated text. The evaluation toolkit will assess the effectiveness of both sentence segmentation and punctuation.

Table 2. Examples of Sentence Segmentation and Sentence Punctuation

Raw Text without Punctuation 亟請於武公公弗許
Annotated Text with Sentencen Segmentation 亟請於武公 公弗許
Annotated Text with Punctuation 亟請於武公,公弗許。

Please note that EvaHan 2024 does not accept running results with sentence segmentation only.

Punctuation Set

In the sentence punctuation task, systems are required to assign punctuation to each sentence, as shown in Table 1.

The punctuation set, is shown in Table 3.

Table 3. Examples of Sentence Segmentation and Sentence Punctuation

Punctuation Name
Comma
Period
Slight-pause
Colon
Semicolon
Question
Exclamation
“” Double Quotes
‘’ Single Quotation
《》 Book Title

Evaluation

Metrics

Each team will initially have access only to the training data. Later, the unlabeled test data will also be released. After the assessment, the labels for the test data will also be released. The scorer employed for EvaHan is a modified version of the one developed for the ref[1]. An illustration of the output of the scorer is given in Table 4. The evaluation will align the system-produced punctuation to the gold standard ones. Then, Sentence Segmentation (SS) and Sentence Punctuation (SP) are evaluated: precision, recall, and F1 score are calculated. The final ranking of teams will be based on the F1 scores.

Table 4. Example of scorers' output

Task Precision Recall F1 Score
Sentence Segmentation 95.00 92.00 93.48
Sentence Punctuation 90.00 91.00 90.50

Two Modalities

Each participant can submit runs following two modalities. In the closed modality, the resources each team could use are limited. Each team can only use the Training data Text_Train, and the pretrained model XunziALLM, which is a large language base model for ancient Chinese processing. Other resources are not allowed in the closed modality.

In the open modality, there is no limit on the resources, data and models. Annotated external data, such as the components or Pinyin of the Chinese characters, word embeddings can be employed. But each team has to state all the resources, data and models they use in each system in the final report.

Table 5. Pre-trained models for closed modality

Model name Language Description
XunziALLM Ancient Chinese Large language base model for ancient Chinese processing.

Baselines

As a baseline, we will provide the scores obtained on test set using SikuRoBERTa-BiLSTM-CRF (Conditional Random Fields) training on train set without additional resources.

How to Participate

Participants will be required to submit their runs and to provide a technical report for the task they participated in.

Registration

If you are interested in participating, please fill out the electronic application form: https://forms.office.com/r/jxDBanU7pd. When filling it out, please make sure your information is correct and your email address is working. After receiving your registration information, we will send you an email to notify you, please pay attention to check it.

Submitting Runs

Each team can submit runs for two tasks. A run should be produced according to the ‘closed modality’. The second run will be produced according to the ‘open modality’. The closed run is compulsory, while the open run is optional.

Once the system has produced the results for the task over the test set, participants have to follow these instructions to complete their submission:

Writing the Technical Report

Technical reports will be included in the proceedings of the Workshop on Language. Technologies for Historical and Ancient Languages 2024 (LT4HALA 2024) as short papers and published alongside the LREC-COLING proceedings.

All the reports must:

• be submitted through the START platform: START submission page of the workshop.

• use the official LREC-COLING style templates.

• not exceed four (4) pages of content (excluding references)

• contain (at least) the following sections: description of the system, results, discussion, and reference.

Reports will receive a light review: we will check for the correctness of the format, the exactness of results and ranking, and overall exposition. If needed, we will contact the authors asking for corrections.

Consultation

If you have any questions about this review, please feel free to send an email to our official email: libin.njnu@gmail.com.

Participants

Organizers

Student Members

Appendix: Selection of Resources

Bibliography

[1] CHENG Ning, LI Bin, XIAO Liming, XU Changwei, GE Sijia, HAO Xingyue, FENG Minxuan. Integration of Automatic Sentence Segmentation and Lexical Analysis of Ancient Chinese based on BiLSTM-CRF Mode. 1st Workshop on Language Technologies for Historical and Ancient Languages, (LT4HALA 2020), pp 52-58. Marseille, 11–16 May 2020.


Back to the Main Page