Decoding Newlines

From CodeCodex

There are three newline conventions for text files in common use:

  • Carriage-return+linefeed — used in MS-DOS/Windows, also standard for most text-based Internet protocols
  • Linefeed only — used in UNIX/Linux systems
  • Carriage-return only — used in MacOS (at least before OS X).

It is quite common for users to try to move text files between systems and then wonder why they can’t be read, because of incorrectly-specified newline conventions. The Internet Principle of Robustness states:

Be liberal in what you accept, conservative in what you send.

Which means it is good practice, whenever you write code to read text files, to be able to accept all three newline conventions.

In the following code, “\015” represents a carriage-return, and “\012” represents a linefeed, while “\n” is used to denote the platform-native newline character.


typedef struct
  /* context for remembering state of decoding of newlines */
    FILE * In; /* where I'm actually reading from */
    bool LastWasCR; /* initially set to false */
  } TextInputContext;

int GetCh
    TextInputContext * Text
  /* returns the next input character from Text->In, doing automatic
    conversion of newlines. */
    int Ch;
    bool IsCR;
    for (;;)
        Ch = getc(Text->In);
        if (Ch == EOF)
        if (Ch != '\012' || !Text->LastWasCR)
            Text->LastWasCR = Ch == '\015';
            if (Ch == '\015' || Ch == '\012')
                Ch = '\n';
              } /*if*/
          } /*if*/
      /* Ch = '\012' and Text->LastWasCR => skip Ch */
        Text->LastWasCR = false;
      } /*for*/
  } /*GetCh*/


It is possible for Python to automatically handle reading text files in universal newline mode. However, some distros do not seem to enable this in their default Python installation. In this case, you can handle the newline conversions yourself, using something like

newline = re.compile("\015\012?|\012")

# the_file is a previously-opened file object
for line in newline.split( :
    ... line contains next text line sans newline ...
#end for