关于utf 8：如何将带有中文字符的EBCDIC转换为UTF

2024-05-17 02:07| 来源: 网络整理| 查看: 265

我需要将使用EBCDIC编码的文件转换为使用IBM937代码页编码的UTF-8格式，以便将文件加载到启用了多字节的DB2数据库中。

我尝试过unix recode和iconv。它们都没有能力将IBM937转换为UTF8。我正在寻找在这个世界上可以在基于UNIX的系统上实现的任何实用工具(Java、Perl、UNIX)。有人能帮我吗？

请看一下ICU(Unicode的国际组件)：http://site.icu-project.org/

它有一个用于IBM-937的转换器：http://demo.icu-project.org/icu-bin/converxp？conv=ibm-937_p110-1999&s=all(全部)

CU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software. ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.

Here are a few highlights of the services provided by ICU:

Code Page Conversion: Convert text data to or from Unicode and nearly any other character set or encoding. ICU's conversion tables are based on charset data collected by IBM over the course of many decades, and is the most complete available anywhere.

Collation: Compare strings according to the conventions and standards of a particular language, region or country. ICU's collation is based on the Unicode Collation Algorithm plus locale-specific comparison rules from the Common Locale Data Repository, a comprehensive source for this type of data.

Formatting: Format numbers, dates, times and currency amounts according the conventions of a chosen locale. This includes translating month and day names into the selected language, choosing appropriate abbreviations, ordering fields correctly, etc. This data also comes from the Common Locale Data Repository.

Time Calculations: Multiple types of calendars are provided beyond the traditional Gregorian calendar. A thorough set of timezone calculation APIs are provided.

Unicode Support: ICU closely tracks the Unicode standard, providing easy access to all of the many Unicode character properties, Unicode Normalization, Case Folding and other fundamental operations as specified by the Unicode Standard.

Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

Bidi: support for handling text containing a mixture of left to right (English) and right to left (Arabic or Hebrew) data.

Text Boundaries: Locate the positions of words, sentences, paragraphs within a range of text, or identify locations that would be suitable for line wrapping when displaying the text.

And much more. Refer to the ICU User Guide for details.

相关讨论感谢您的响应，有没有关于将EBCDIC转换为UTF8以从文件中读取字符的示例帮助？不是在Java中。.NET世界具有System.Text.Encoding类，该类支持不同编码之间的转换。微软框架支持一些EBCDIC编码，但不支持IBM937。不知道Mono(开源.NET)支持什么，也不知道编写自定义编码有多困难。.NET方法是读取一个八位字节数组，并将其传递给编码器/解码器，然后获取返回的字符串。由于需要处理单字节和双字节模式之间的移入和移出，EBCDIC双字节内容使其复杂化。

看来，Java可以将IBM M937代码页转换为UTF-8。

您将输入格式指定为"CP937"。

以下是Oracle页面中关于字符和字节流的两种方法：

12345678910111213141516171819static String readInput() { StringBuffer buffer = new StringBuffer(); try { FileInputStream fis = new FileInputStream("test.txt"); InputStreamReader isr = new InputStreamReader(fis, "cp937"); Reader in = new BufferedReader(isr); int ch; while ((ch = in.read()) > -1) { buffer.append((char)ch); } in.close(); return buffer.toString(); } catch (IOException e) { e.printStackTrace(); return null; } }

和

1234567891011static void writeOutput(String str) { try { FileOutputStream fos = new FileOutputStream("test.txt"); Writer out = new OutputStreamWriter(fos,"UTF8"); out.write(str); out.close(); } catch (IOException e) { e.printStackTrace(); } } 相关讨论这一个最初不起作用。当我导入icu4j字符集库和reran时，它似乎运行良好。但是，我的中文EBCDIC字符似乎没有按预期的方式转换？有什么想法吗？最初的人物是这样的：凧刡凮剭勄剭刣剭刡匎刪？僔働_牼僉僌僑僒匎刪？僔働_牼僉僌僑僑僒？转换后我得到了像：副总裁？失去_VP？_lost______ @我的第一个想法是，你确定输入是ibm937(cp937)吗？还有一些其他的IBM EBCDIC中文页面。有CP1371、CP950、CP964和CP948。是的，我做到了。上面的一些字符集似乎没有转换，其他字符似乎给出了相同的结果。此外，发送器确认文件仅使用代码页IBM937发送。 @我被难住了。也许你可以尝试使用UTF-16作为输出，看看它是否有什么不同。还是一样。让我用传送器三次检查一下。感谢您的快速回复。

【本文地址】

公司简介

联系我们