• Specific Year
    Any

Pun, K H; Ip, Gary; Chong, C F; Chan, Vivien; Chow, K P; Hui, Lucas; Tsand, W W; Chan, H W --- "Processing Legal Documents in the Chinese-Speaking World: the Experience of HKLII" [2004] UTSLawRw 9; (2004) 6 University of Technology Sydney Law Review 132


PROCESSING LEGAL DOCUMENTS IN THE CHINESE-SPEAKING WORLD: THE EXPERIENCE OF HKLII

K H Pun, Gary Ip, C F Chong, Vivien Chan,

K P Chow, Lucas Hui, W W Tsang, H W Chan

[1]

The Hong Kong Legal Information Institute (HKLII)

is a joint project

within the University of Hong Kong between the Department of

Computer Science and Information Systems (CSIS) and the Faculty of

Law. It is the newest member of the World Legal Information Institute

(WorldLII). HKLII was initiated by Professor Graham Greenleaf of the Australasian Legal Information Institute (AustLII) and was greatly assisted by the technical staff of AustLII during its initial stage of development. HKLII is now fully operated and maintained by the CSIS Department at the University. With a view to promoting and supporting the rule of law in Hong Kong, HKLII is a free, independent, non-profit Internet facility providing the general public with legal information relating to Hong Kong.

Bilingual Nature of Legal Information in Hong Kong

Statutory law in Hong Kong was, until 1989, enacted in English only.

A new chapter in the legislative history of Hong Kong began with the amendment of the Hong Kong Royal Instructions2 in August 1986 and the Official Languages Ordinance (Cap 5) in March 1987, which ushered in the era of bilingual laws. Pursuant to section 4 of the amended Official Languages Ordinance, all new ordinances must be enacted and published in both the English and Chinese languages, except for ordinances that amend other ordinances enacted in the English language only. In parallel with this, a new section 10B was added to the Interpretation and General Clauses Ordinance

(Cap 1) stipulating that both the English and Chinese texts of an ordinance shall be equally authentic. The first bilingual ordinance, the Securities and Futures Commission Ordinance (Cap 24), was enacted in April 1989. Since then, more than 4,000 bilingual ordinances, amending ordinances and pieces of subsidiary legislation have been enacted in Hong Kong.

Apart from statutory law, the use of Chinese as an alternative to English was allowed in the Magistrates’ Courts in 1974. In February 1996, the

[132]

restriction against the use of Chinese in the District Court and the Lands Tribunal was lifted. In December 1996, similar restrictions in the High Court for hearing appeals from Magistrates’ Courts, the Labour Tribunal and the Small Claims Tribunal were also removed. Finally, in June 1997, the use of Chinese in all civil and criminal proceedings was allowed in the High Court. The stage was then set for a bilingual court system in Hong Kong. As a result, there has been a steady increase in the number of cases tried in Chinese.

With the bilingual laws and bilingual court system in Hong Kong, the legal information housed in HKLII naturally includes bilingual materials written in Chinese and English. HKLII must be able to handle both English and Chinese documents.

Problems in Processing Chinese

Two features of Chinese not present in Western languages are the large set of Chinese characters and the absence of boundary between adjacent Chinese words. These two features have given rise to two well- known problems in processing Chinese: the encoding problem and the segmentation problem.

Encoding Problem

Chinese is written with more than 40,000 characters (ideograms composed of strokes of various types). The conventional method of using one byte to encode English for computer processing is therefore not appropriate for Chinese. Any method for encoding Chinese for computer processing must be able to handle the large set of Chinese characters. All existing encoding methods for Chinese, such as Big5, GB, GBK and Unicode employ multi-byte codes. This difference in code size means that programs written for processing English documents cannot be used to process Chinese documents directly.

Segmentation Problem

Another feature that distinguishes Chinese from western languages is that

a Chinese sentence is a sequence of characters without any delimiters in between. The Chinese language does not use space (or any other delimiter) to indicate word boundaries. Thus it is for the reader to segment a sentence into its individual words. This may sometimes give rise to ambiguities, as illustrated by the following sentence:

ݺଚ૞࿇୶խഏ୮شሽᕴ [3]

One possible segmentation of the sentence is

ݺଚ ૞ ࿇୶ խഏ ୮شሽᕴ

which gives the sentence the following

meaning: “We have to develop China’s home electrical appliances.” But

another possible segmentation is

ݺଚ ૞ ࿇ ୶խഏ୮ ش ሽᕴ

“We want developing countries to use electrical appliances.”

The segmentation problem is even more acute in ancient Chinese texts, which do not use punctuation marks. Thus the reader must decide not only the word boundaries but also the sentence boundaries. The difficulties in parsing such texts and the ambiguities that may arise are illustrated by the following passage.

ྤᠪᚅՈױྤູᓚՈױॹလߤፍլױ֟ԫଡᙒՈլ૞

This passage does not have any punctuation mark, and even the sentence boundaries are unclear. A reader may segment the passage into sentences by supplying the punctuation marks as follows:

ྤᠪΔᚅՈױΖྤູΔᓚՈױΖॹလߤፍլױΖ֟ԫଡᙒՈլ૞Ζ

With these punctuation marks, the passage reads, “If there is no chicken, duck is fine. If there is no fish, shrimp will do. I do not want green vegetables or bean curds. And I demand the exact sum to be paid to me.”

However, another reader may supply punctuation marks in the following way, which is equally legitimate:

ྤᠪᚅՈױΖྤູᓚՈױΖॹလߤፍլױ֟ΖԫଡᙒՈլ૞Ζ

The meaning of the passage is completely different from the previous one. It now reads, “It is alright if there is no chicken or duck. It is alright

if there is no fish or shrimp. But I must have green vegetables and bean curds. And I won’t accept any money paid to me.”

The examples above illustrate the segmentation problem in Chinese and the intrinsic difficulty in parsing Chinese. Although modern Chinese uses punctuation marks and thus avoids the segmentation problem at the sentence level, the segmentation problem remains at the word level. Because it is the Chinese word (rather than the Chinese character) that

is the basic semantic unit, any system that processes Chinese documents must be able to recognise the Chinese words contained in the document. As there are no word delimiters to rely on, the system must be able to decide how to segment the document into individual words that reflect the correct meaning of the document. This is not always an easy task.

Over the years, much work has been done on the Chinese segmentation problem. To date, two approaches stand out: (1) the dictionary-based approach, and (2) the statistics-based approach.

The dictionary-based approach relies on dictionaries that contain the most common words and employs heuristic rules to recognise compound words not found in dictionaries. Using this approach, the system’s performance in segmentation depends greatly on the comprehensiveness of the dictionary. In recognising words, one commonly used algorithm is

“longest matching” (also called “maximum matching”).

This algorithm seeks to identify all words in a sentence by scanning it from beginning to end, always preferring the longest-matching word whenever there is a choice. Thus for example, for the sequence of characters Գ֫៬᤟ , the algorithm will recognise it as one word Գ֫៬᤟ (manual trans- lation) rather than three words Գ ֫ ៬᤟ (man, hand, translation). The algorithm generally results in the least number of words in a sentence.

However, it does not always guarantee correct segmentation, as sometimes the shorter words may actually be the correct ones in the given context. The statistics-based approach[4], on the other hand, does not require

any dictionary. Instead, a set of manually segmented sentences is fed to the system to serve as training data. The system then compiles statistical information from the data (such as word occurrence frequencies), which information is used to create a table of words and their corresponding weights. These weights are used to compute the score for a potential segmentation of a sentence. If a sentence can be segmented in more than one way, the segmentation with the highest score computed based on the weights of the words identified therein will be selected. Clearly, this approach relies heavily on the table generated from the training data, and the accuracy and scope of the training data will therefore significantly affect the system’s performance in segmentation.

Processing Chinese in HKLII

Central to the processing of Chinese in HKLII is the issue of how to conduct searches on Chinese documents. For English documents, the question is well settled like all other LIIs, HKLII also uses the Sino search engine5 for indexing and searching. Sino is a stable, robust and fast free text retrieval engine intended for use with httpd and other embedded applications. Unfortunately, it does not support indexing and searching in non-western languages. This presents a problem for HKLII, which must provide the user with facilities to search in both English and Chinese.

There are two options for HKLII to expand its search capabilities to Chinese documents: one is to develop its own search engine; the other is to use some freeware available in the public domain. After some initial search on the Web, we located several search engines that, although not perfect, nonetheless can provide most of what we need. In view of the urgent need of providing Chinese search capabilities in HKLII, we decided to make use of existing search engines rather than to build a search engine of our own.

Of all the search engines we have found on the web, there are two that meet our main criteria of being free and being able to support Chinese indexing and searching. They are ASPseek6 and mnoGoSearch[7], both distributed under the GNU General Public Licence[8]. The two engines are related in that ASPseek was built by some former developers of UdmSearch, which was later renamed mnoGoSearch. Among their similar features, both search engines support Unicode and use the dictionary-based approach for segmentation. We have done some studies on the two search engines and compared their features. The lists below summarise our findings.

Features provided by both ASPseek and mnoGoSearch:

Ability to index and search through several million documents

HTTP, HTTP proxy, FTP (via proxy) protocols

• HTTP basic authorisation

• HTTPS protocol

• Text/html and text/plain documents support

• Other document types support via external converters

• Being multithreaded

• Stopwords

• Unicode support to handle multiple character sets (including CJK)

simultaneously

• Character set guesser (optional)

• Robot exclusion standard (robots.txt) support

• Advanced search capabilities

• Ispell support

• Query words highlighted in results

• HTML templates for easy-to-customise search results

• Cached compressed local copy of every indexed document

• Clones (mirrored documents) detection

• Phrase segmenting for Chinese and Japanese.

Features provided by ASPSeek:

• Asynchronous DNS resolver

• Settings to control network bandwidth usage and Web server load

• Real time asynchronous indexing

• Very good relevance of results

• Support of the Linux platform.

Features provided by mnoGoSearch:

• Wide range of database support including MySQL, PostgreSQL, Oracle and DB2

• Built-in SQLite database support for small sites

• Supports for various platforms including Free BSD, Linux 2.x, SunOS, Solaris,

• Open BSD and AIX.

ASPseek ought to work better, as it has incorporated more recent research results in processing Chinese as well as indexing Asian languages.

It also provides explicit support for Big5, GB and Unicode, and comes with

a dictionary containing almost 130,000 Chinese words, which is far more comprehensive than the dictionary of mnoGoSearch, which comprises only about 40,000. Unfortunately, we are unable to make ASPseek run

on non-Linux platforms. As the HKLII server runs on Solaris-x86, we have made some attempts to get the Solaris port of ASPseek to work on HKLII, but without success. This makes mnoGoSearch the only practical choice for us.

SEARCH ENGINE MNOGOSEARCH

The search engine mnoGoSearch is made up of two parts: the indexer and the search module. The indexer builds indices for the search module by taking three steps.

(1) It scans files stored on targeted web servers and extracts from them strings of Chinese characters delimited by punctuation marks (up to a maximum of thirty two characters).

(2) Each string thus extracted is segmented based on the dictionary embedded in the search engine.

(3) The segments so obtained are stored as indices in a database used by the search module in the search process.

Accordingly, to use mnoGoSearch on HKLII, we have to set up a backend database for mnoGoSearch to store its indexing information. We have chosen MySQL9 for this purpose because MySQL has proven to be one of the most scalable and reliable databases in the open source community.

As the search process of mnoGoSearch is based on the indices stored on the backend database, the organisation of the indices will have a major impact on the engine’s search performance. Because a word can be found in more than one document, each index stored by mnoGoSearch takes the form of one-to-many mapping between a word and the URLs of all the documents found to contain the word. Such indices are kept in tables, which can be organised in one of several storage modes. These storage modes can be broadly divided into two categories: those that use

a single table for storing the indices (“single table mode”) and those that use multiple tables (“multiple tables mode”).

In the single table mode, all indices are stored in one table with the structure (url_id, word, weight), where “url_id” is the unique identifier of a document containing the “word”, and the “weight” is used to assess the relevance of the document. In the multiple tables mode, the indices are stored in thirteen different tables, each with a different size to accommodate the different lengths of words. The structure of the multiple tables is otherwise similar to that of the single table. In essence, the multiple tables can be viewed as the result of splitting the single table into thirteen different tables, each with an appropriate size to cater for the words kept in that table. Each multiple table is therefore much smaller than the single table, and search on a multiple table is much faster. Hence if the index table is large, the multiple tables mode is to be preferred, as it generally performs better than the single table mode in terms of search time.

[137]

The search performance can be improved further by a simple mechanism: not storing the word in the index table but rather as a thirty two bit integer word ID computed by applying the CRC32 algorithm10 to the word. This will further reduce the size of the index table, resulting in more efficient search. However, CRC32 does have one drawback: because it is the word ID that is stored rather than the word itself, partial matching of a word cannot be performed.

Adapting mnoGoSearch for Searching Chinese

Documents

In mnoGoSearch, segmentation on strings of Chinese characters extracted from websites by the indexer is performed based on the dictionary embedded in the search engine. At present, the dictionary contains only

40,000 Chinese words, most of which are common words for general use. The dictionary also comes with a frequency table listing the number of occurrences of each word that would appear in ordinary Chinese texts. Relying on this frequency table, mnoGoSearch segments a string of Chinese characters by employing dynamic programming techniques to find the segmentation that has the highest probability of occurrence.

While mnoGoSearch works well in recognising ordinary Chinese words, it is not designed for recognising Chinese words not present in its dictionary. For HKLII, virtually all the documents contain some legal terms not found in the dictionary of mnoGoSearch. These legal terms are unlikely to be recognised by the indexer and appear as segments on their own. Rather, such legal terms may either be (1) split between adjacent segments as segmented by the indexer, or (2) appear as part of a segment

(together with other characters).

Because such segments are used as indices by mnoGoSearch, the documents containing the legal term in scenario (1) will be associated with indices that do not contain the legal term, whereas the documents in scenario (2) will be associated with indices that contain the legal term

(amidst other Chinese characters). This means that if one searches for the legal term, one would miss the documents in scenario (1), as none of the indices contains the legal term; but one should be able to catch the documents in scenario (2), provided that one goes “sub-string searching”— looking for sub-strings in the indices.

SUB-STRING SEARCHING

The concept of looking for sub-strings is not new. Generally, when a keyword is submitted to a search engine, the search engine will look in its database for indices whose word field matches the keyword. Sub- string searching means that a search is considered successful even if the keyword matches only part of the word field in the index. For example, if

[138]

the keyword is “an”, sub-string searching will return all indices containing

an, such as Ant, Another, Language, Can, and so on. Clearly, substring searching will result in more matches, but it is also more time consuming and may return indices pointing to documents that are less relevant. Hence even Google[11], one of the most popular search engines, does not support sub-string searching—in order to maintain good performance. However, as explained above, because of the limited vocabulary of the dictionary embedded in mnoGoSearch, sub-string searching seems inevitable for using mnoGoSearch on HKLII, despite the potential degradation in search performance.

To ascertain the effectiveness of the dictionary in mnoGoSearch for HKLII and to see whether sub-string searching is indeed necessary, we have done some tests on mnoGoSearch by performing searches with and without sub-string searching, and comparing results. We have found that for HKLII, sub-string searching will substantially increase the number of documents retrieved. As an example, the search for “ 壄壀ᦞܓ” (moral rights) returned only nineteen documents when sub-string searching was not used, but thirty documents when it was. The nineteen documents

returned without using sub-string searching all share one feature: the word 壄

in these documents is either delimited by punctuation marks (e.g. placed within brackets), or delimited by spaces. This reflects the limitation of mnoGoSearch: namely, that it uses the segmentation method for English word (i.e. looking for delimiters such as space and punctuation marks), and that words not present in its dictionary are unlikely to be recognised.

Our Experience with mnoGoSearch

In the early days of using mnoGoSearch on HKLII, we set up the search engine to index all documents stored on the HKLII server. There were approximately 160,000 documents, either in Chinese or English. With this amount of data, our search engine worked fairly well for words that occurred frequently, such as (Hong Kong) and (China). With sub-string searching, the search engine could return the result in a short time, usually within 10 seconds. However, the search engine would produce very poor results for Boolean queries, especially for some complex Boolean expressions involving the less common Chinese words. Such queries appeared to have overloaded the search engine, resulting in certain “data transfer timeout” errors.

To reduce the search time, we decided to cut down the size of the back end database for searching Chinese documents, by separating the Chinese documents from the English documents. For the English documents, we stayed with Sino, the search engine used by all other LIIs. But for Chinese documents, we used mnoGoSearch with sub-string searching as the default mode. With this reorganisation, the search space of mnoGoSearch was

drastically reduced to only 86,000 documents (all in Chinese). This greatly improved the search performance, especially for Boolean queries. We tried queries involving up to eight “AND” operators, and the search engine was generally able to return the results within 30 seconds. Although this was not as fast as Sino in searching English documents, it was acceptable as

a start.

To further reduce the search time, we have tried fine tuning the back end MySQL database, such as by putting the database on a faster RAID device and triggering the “OPTIMIZE TABLE” command periodically to de-fragment the data storage. We have also made use of a search-results- cache, which allows very fast return on queries that produce the same results as the recent queries recorded in the cache.

Apart from improving its search performance, we have further adapted mnoGoSearch to make it look and feel like Sino. This is important to the user. As both mnoGoSearch and Sino are used in HKLII, the existence of the two search engines should be made as transparent to the user as possible. Furthermore, since all LIIs use Sino, it would be convenient to the user if mnoGoSearch also provides a search interface similar to that of Sino. Fortunately, mnoGoSearch does work in a way similar to Sino. Both use the same symbols “&” (and) and “|” (or) in Boolean queries, both allow searching within a selected area (e.g. search in ordinances only), and both provide similar options for searching (e.g. matching all keywords, or any one of them.)

As an enhancement to the searching facility, we have also improved the highlighting function of mnoGoSearch by ensuring that it correctly highlights all occurrences of a keyword found in a document requested by the user. The original highlighting function of mnoGoSearch does not work well for Chinese characters. We have re-designed this part by using Sino’s search quoting function as an external wrapper to do the highlighting, and applying a JavaScript library available from the Web called “searchhi” [12], which automatically highlights all occurrences of a keyword in a document returned by a search engine that looks for the keyword.

Possible Extensions for HKLII

The discussions above describe the work we have done on HKLII so far. MnoGoSearch is by no means perfect, and indeed does not guarantee retrieving all the documents that match a user’s query. What is clear though, is that sub-string searching must be used in mnoGoSearch to improve searching accuracy, for otherwise a substantial number of documents will be missed because of incorrect indices based on the limited vocabulary of the dictionary embedded in mnoGoSearch.

However, sub-string searching cannot help to find documents that are associated with indices that do not contain the keyword the user is

looking for. This situation may arise because of incorrect segmentation performed by the indexer of mnoGoSearch, which results in the keyword being split between adjacent segments, as described earlier. One possible solution to this problem is to build into mnoGoSearch our own special purpose dictionary for HKLII, the other is to construct our own search engine for HKLII.

Building our own special purpose dictionary

Since mnoGoSearch relies on its dictionary to segment strings of characters extracted from websites for building indices, the quality of such indices depends heavily on the relevance and suitability of the dictionary. A good and appropriate dictionary is therefore critical. Accordingly, we have decided to build such a dictionary in the next stage of development for HKLII. We will have to analyse all the documents stored in HKLII in order to extract the appropriate words. Furthermore, once the dictionary

is built, we will have to keep it up to date by continuing to recognise new words found in new documents that are constantly added to HKLII.

Building our own search engine

Alternatively, we may implement our own search engine for HKLII instead of using mnoGoSearch. To avoid the segmentation problem in Chinese, we may construct a “monogram” search engine that builds indices based on characters instead of words. It will be the individual characters, rather than the words, that are recognised during the indexing phase. A one-to- many mapping table will then be built to store each character along with the URLs of all the documents found to contain that character.

This table will be used to look up Chinese documents. If a user searches for Chinese documents containing a keyword, the search engine will use the table to retrieve, for each character in the keyword, the URLs of all the documents that contain the character. This process is repeated for each character in the keyword, resulting in one set of URLs for each character. The search engine will then compute the intersection of all such URLs, the result of which will be the URLs of all the documents containing all the characters. A final check on these documents will see if the characters appear contiguously and in the same order as set out in the use query. Only documents that satisfy this condition will be returned to the user.

Thus for example, if a user searches for ଉཽ (Hong Kong), the search

engine will first look for the URLs of all documents that contain the character ଉ, then the URLs of all documents that contain the character ཽ.

These two lists will be compared to find the URLs which refer to documents that contain both characters. A final check is then performed on these documents to select only those in which the characters appear as an adjacent pair. Such documents will be returned to the user.

Such a monogram search engine will only be applicable to searching Chinese documents. It makes little sense to index English documents just by the twenty six Roman letters. Thus even if such a search engine

[141]

is constructed, it will be confined to Chinese documents, and will remain separate from the Sino engine used for searching English documents.

Conclusion

The experience of HKLII shows that managing a bilingual legal database comprising English and Chinese documents is not a trivial task. Because of the difference in encoding between Chinese and English, and the absence of word boundaries in Chinese, many of the existing tools (most notably, search engines) for handling English documents cannot be used for Chinese documents.


1 <http://www.hklii.org.hk> .

[2] Hong Kong Royal Instructions 1917–1993 (1 and 2) (formal instructions, issued under the

Royal Sign Manual and Signet, to the Governor of Hong Kong).

3 Andy Wu and Zixin Jiang, “Word Segmentation in Sentence Analysis” <http://

research.microsoft.com/nlp/publications/ICCIP98.pdf>.

4 Jiang Chen, “Parallel Text Mining for Cross-Language Information Retrieval, Using a

Statistical Translation Model” <http://www.iroumontreal.ca/~chen/thesis/thesis.html> .

[5] Sino (Short for Size Is No Object) is a free text retrieval engine developed by AustLII

<http://www.austlii.edu.au/austlii/help/sino.html> .

[6] ASPseek is an Internet search engine developed by SWsoft and licensed as free software under GNU GPL <http://www.aspseek.org> .

[7] mnoGoSearch (formerly known as UdmSearch) is a full-featured web search engine for intranet and internet servers. mnoGoSerach for UNIX is a free

software covered by the GNU GPL <http://www.mnogosearch.org> .

[8] <http://www.gnu.org/copyleft/gpl.html> .

9 <http://www.mysql.com> .

10 See, RFC1510, Section 6.4.1 <http://www.ietf.org/rfc/rfc1510.txt> .

11 <http://www.google.com> .

12 <http://www.kryogenix.org/code/browser/searchhi> .

Download

No downloadable files available