Diacritics and Special Characters

Innovative library systems installed in July 2004 or later store diacritics and special characters in Unicode. Unicode code points are officially written in hexadecimal (base 16) and prefixed with U+ – e.g., U+00E1. To maintain compatibility with Millennium, Unicode values are stored within braces and prefixed with a “u” (e.g., {u00e1}) to indicate them as such rather than Innovative diacritics that pre-date the Unicode Standard http://www.unicode.org/. Diacritic/letter combinations that do not have a unique Unicode value such as ‘u’ with a dot above are stored with the diacritic value following the letter it modifies, u{u307}. To display the Unicode values of diacritics and special characters in a record, open the record in View or Edit mode and choose View > Show Codes from the menu bar.

NOTE: “The Unicode names of characters are written in all uppercase in the Unicode standard, but this is just a convention. In fact, the standard itself spells the names in all lowercase in some contexts.”1

Record Import

From OCLC Connexion Client

The Connexion client exports authority and bibliographic records in MARC-21 format using either the MARC-8 or UTF-8 Unicode character set. Along with converting between encodings, the selections can result in the following changes.2

For MARC-8 exports,

  • leader position 9 is set to “ “ (blank)
  • Unicode characters that cannot be converted to MARC-8 are changed to the hexadecimal value enclosed in brackets
  • field 066 and subfield 6 (without a script identifier) are included for records containing non-Latin scripts

For Unicode exports,

  • leader position 9 is set to “a”
  • all characters are exported, even if not included in MARC-8 characters approved for MARC-21 cataloging
  • field 066 is not included, but a script identifier is for records containing non-Latin scripts

No matter the encoding standard selected for import into Millennium, all diacritics and special characters are stored in Unicode. This means all records with diacritics and the leader position 9 set to “ “ (blank) use UTF-8 character encodings instead of MARC-8. An aid in attempting to control such miss-coded records (and data) to comply with the MARC-21 encoding rules is to know both the encoding of the incoming records and library database as well as ensure it matches from source to destination. Thus, a Unicode configured database should export and subsequently import MARC records in Unicode UTF-8.

From All Other Sources

Importing system compatible diacritics and special characters largely starts with whether or not Unicode UTF-8 is set as the character encoding in the export settings of the client used for downloading or transferring MARC-21 authority and bibliographic records or specified in a library profile for third-party vendors supplying records or assisting in other cataloging services.

If Unicode UTF-8 is not provided as an option and the records contain scripts, like Bengali, Devanagari, Tamil, or Thai, that are outside of the MARC-8 character set, then the MarcEdit “Translate to UTF8” tool can be used to convert the data.

MarcEdit Translate to UTF8 Tool

MarcEdit Translate to UTF8 Tool

If Unicode UTF-8 is not provided as an option and the records contain scripts within the MARC-8 character set, then diacritics and special characters should be automatically converted to their Unicode equivalents.

Record Creation and Editing

Adding or changing diacritics or special characters (aka inserting Unicode characters) into records is done by either using the Character Map or entering in their Unicode values, and less so by copying and pasting.

To use the Character Map:

  1. Once in a record, dialog box, or anywhere one can enter text, choose the menu options Tools > Character Map.
  2. Choose a character set from the Code Chart drop down menu.
  3. Place the cursor where the diacritic or special character is to go (either in place of or following the letter it is to modify) and select the Insert button
  4. Choose the Close button to close the Character Map

An alternative to using Millennium’s character maps is to enter in the hexadecimal equivalent of the diacritic or special character. This is done by prefacing the Unicode value by “u” and then enclosing the entire identifier in braces (i.e., {u<code>}). Hexadecimal values are provided in each character map, or code chart, as the combination of the left hand row headers and top column headers. For example, 002 + 1 equals the Unicode value 0021 and thus the exclamation mark. Outside of Millennium, hexadecimal values are provided within the Unicode Consortium’s code charts at http://www.unicode.org/charts/.

Copying and pasting diacritics or special characters from other cataloging utilities into Millennium may cause a mix of encodings and “a MARC record should not mix encodings – it should be either MARC 8 or Unicode UTF-8” (McCallum, 2005, p.4). With MARC-8 records coded as UTF-8, UTF-8 records coded as MARC-8, and those coded as MARC-8 or UTF-8 that are actually neither, one is better off not making edits using copy/paste. Along with miss-encoding scripts, copying/pasting can also introduce other invalid characters, especially the non-printing ones like carriage returns or tabs. The solution then is to use either the system-provided character maps or enter in Unicode values.

Record Export

MARC-8 is the current encoding for output data, and so byte 9 of the MARC leader is updated to ” ” (blank). The export table can be customized to output Unicode UTF-8 or a second one can be added to support both encodings.

 

Sources

  1. Jukka K Korpela, Unicode Explained (Sebastopol, CA: O’Reilly Media, 2006), 19, https://books.google.com/books?id=lxndiWaFMvMC&lpg=PR1&dq=unicode%20korpela%202006.
  2. OCLC, “Cataloging: Export or Import Bibliographic Records,” OCLC Connexion Client Guides (2014): 12, https://www.oclc.org/content/dam/support/connexion/documentation/client/cataloging/exportimport/exportimportbib.pdf.

 

Related: http://library.clemson.edu/depts/olt/diacritics-troubleshooting/

Last updated on 1/24/2018