Gulliver

Home » CentOS » Gulliver
CentOS 5 Comments

We have been fortunate to hang onto one of our summer interns for part time work on weekends during the current school year. One of the intern’s jobs is to load documents and data which are then processed.  The documents are .txt, .docx, and .pdf files. The data files are raw sensor outputs usually captured using ADCs mostly with eight bit precision.  All files are loaded or moved from one machine to another with sftp.

The intern noticed right a way that the documents will transfer perfectly from our PPC and SPARC machines to our Intel/CentOS
platforms.  The raw data files, not so much.  There is always an Endian (Thanks Gulliver) issue, which we assume is due to the bytes of data being formatted into 32 bit words somewhere in the Big Endian systems.  It is not totally clear why the document files do not have this issue.  If there is a known principle behind these observations, we would appreciate very much any information that can shared.

5 thoughts on - Gulliver

  • Transferring a file will not change anything. It will be bit-wise identical.

    However the data in the file may be in bit-wise little or big endian order. A file format may or may not have metadata indicating this. That is, some files will read differently on different arch’es and some will be immune (due to more sophisticated abstractions).

    So it’s not surprising that your raw files will have problems.

    If you want to prove this to yourself simply md5sum/sha1sum/etc the files on both sides.

    /Peter K

  • Text files which are ascii are generally 7->8 bit so don’t tend to have bit endian problems in 8+ bit architectures. [I expect a 4 bit architecture would have problems]. Now 8+ bit UTF can have some problems with endianess but it is usually not following some standard and assuming that writing data works the same as it did with ascii
    (mainly because few people dealt with 4 bit computers).

    docx and pdf is written for a fixed endian format so even if built/written on a big endian system the data itself is formatted to be little endian. Raw data files are usually endian if they are ‘raw’
    memory dumps or similar. Some ‘data’ formats which are mostly raw are actually written to a standard which will work because both the little endian and big endian expects the data to be written in ‘big’ or
    ‘little’ endian and read in as such.

  • It’s unlikely that copying the files is causing the problem you observe.  As Peter suggested, you can use “md5sum” on the source and destination hosts to demonstrate that the files are not being modified in transmission.

    However, endianness can be a problem if the applications you use naively save data to a file in their native byte order, and also read in native byte order.  In situations like that, a big-endian system will save data that the same application will fail to read, when it is run on a little-endian system.

    If this is an application that you’ve developed in-house, you should be using htonl() to convert your 32-bit values to network byte order and writing that value to the data file, and using ntohl() to convert 32-bit values that you read from data files to the native host byte order.

  • …or its superset, XDR [1]
    …or use a text format (XML, JSON, YAML, SQL, CSV…)
    …or use a binary serialization of same (BSON, CBOR, Binary XML…)
    …or use FlatBuffers [2]
    …or use ASN.1 [3]

    or, or or. This problem is *solved*. The only difficult part is choosing which of the many available solutions to use.

    [1]: https://en.wikipedia.org/wiki/External_Data_Representation
    [2]: https://en.wikipedia.org/wiki/FlatBuffers
    [3]: https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

  • Well, yes.  But if endian data is the problem, then it’s pretty clear that none are in use, and I’m suggesting the absolute minimum-effort solution to the problem.