64KB String limit in Java data streams

2023-10-07 16:50| 来源: 网络整理| 查看: 265

Java鈥檚 DataOutputStream and ObjectOutputStream are not able to serialize Strings larger than 64KB. Let鈥檚 try and write a really long String into a data stream:

public static void main(String[] args) throws Exception { // generate string longer than 64KB StringBuilder sb = new StringBuilder(); for (int i = 0; i < 10000; i++) sb.append("1234567890"); String s = sb.toString(); // write the string into the stream ByteArrayOutputStream baos = new ByteArrayOutputStream(); DataOutputStream dos = new DataOutputStream(baos); dos.writeUTF(s); dos.close(); }

If you run the code above, you will get something like this:

Exception in thread "main" java.io.UTFDataFormatException: encoded string too long: 100000 bytes at java.io.DataOutputStream.writeUTF(DataOutputStream.java:347) at java.io.DataOutputStream.writeUTF(DataOutputStream.java:306) at com.example.Demo.main(Demo.java:28)

What just happened? The Javadoc comes to the rescue:

First, two bytes are written to out as if by the writeShort method giving the number of bytes to follow.

Two bytes length prefix will cap the number of bytes to the 64KB limit. Digging into the JVM sources, it has an explicit check for it:

if (utflen > 65535) throw new UTFDataFormatException( "encoded string too long: " + utflen + " bytes"); What can we do about it?

If you are using some 3rd party library and have no mean to access the source, then you are at their mercy, and you can just hope that you won鈥檛 have such long Strings.

If you are able to access the source codes, you may have better chances: you can define or modify the binary format of your data. Of course there are cases when this is not really possible, but for now, let us suppose you have created your binary format in an extensible way (with version bits or whatever tracking) because that allows us to focus only on the writeUTF() method:

(1) Use byte[] arrays

You can manually transform the String to byte[] (with e.g. s.getBytes("utf-8")). Put the buffer's length as a 4-byte int prefix in the beginning of the stream, and reading won鈥檛 be a problem either.

(2) Split your String into smaller chunks

You might split the String into ~16KB chunks, and call writeUTF for each of them. Easy and no need to mess with manual byte[] transforms.

(3) Use a custom length prefix

If you are lucky enough and you might even do an incremental upgrade for you binary format. As the writeUTF() fails on null values, you may wrap the writes in blocks something like the following:

if (s == null) { // mark that we had a null value dos.writeByte(0); // no string to write } else { // mark the non-null reference dos.writeByte(1); // write the string dos.writeUTF(s); }

This code uses a single byte prefix to mark the null/non-null value of the following String. One can easily extend the code to check for String length and perform different writes, like this:

if (s == null) { dos.writeByte(0); } else { dos.writeByte(1); if (s.length() < 16*1024) { dos.writeUTF(s); } else { // here comes the simple workaround byte[] b = s.getBytes("utf-8"); dos.writeInt(b.length); dos.write(b); } }

There are no silver bullets for this problem, and in the end, workarounds contain some level of "hacks".

I鈥檝e originally published this short article on oktech in 2009.

Last updated: August 29, 2014 Istv谩n So贸s software engineer, business advisor

Advocates for the maker-movement, self-directed learning and agile methods. His regular topics include: machine intelligence, data and risk analysis, distributed systems and knowledge management.

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章