How to read a file in reverse order?

Posted on

Solving problem is about exposing yourself to as many situations as possible like How to read a file in reverse order? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about How to read a file in reverse order?, which can be followed any time. Take easy to follow this discuss.

How to read a file in reverse order?

How to read a file in reverse order using python? I want to read a file from last line to first line.

Asked By: Nimmy

||

Answer #1:

for line in reversed(open("filename").readlines()):
    print line.rstrip()

And in Python 3:

for line in reversed(list(open("filename"))):
    print(line.rstrip())
Answered By: Matt Joiner

Answer #2:

A correct, efficient answer written as a generator.

import os
def reverse_readline(filename, buf_size=8192):
    """A generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
        segment = None
        offset = 0
        fh.seek(0, os.SEEK_END)
        file_size = remaining_size = fh.tell()
        while remaining_size > 0:
            offset = min(file_size, offset + buf_size)
            fh.seek(file_size - offset)
            buffer = fh.read(min(remaining_size, buf_size))
            remaining_size -= buf_size
            lines = buffer.split('n')
            # The first line of the buffer is probably not a complete line so
            # we'll save it and append it to the last line of the next buffer
            # we read
            if segment is not None:
                # If the previous chunk starts right from the beginning of line
                # do not concat the segment to the last line of new chunk.
                # Instead, yield the segment first 
                if buffer[-1] != 'n':
                    lines[-1] += segment
                else:
                    yield segment
            segment = lines[0]
            for index in range(len(lines) - 1, 0, -1):
                if lines[index]:
                    yield lines[index]
        # Don't yield None if the file was empty
        if segment is not None:
            yield segment
Answered By: srohde

Answer #3:

You can also use python module file_read_backwards.

After installing it, via pip install file_read_backwards (v1.2.1), you can read the entire file backwards (line-wise) in a memory efficient manner via:

#!/usr/bin/env python2.7
from file_read_backwards import FileReadBackwards
with FileReadBackwards("/path/to/file", encoding="utf-8") as frb:
    for l in frb:
         print l

It supports “utf-8″,”latin-1”, and “ascii” encodings.

Support is also available for python3. Further documentation can be found at http://file-read-backwards.readthedocs.io/en/latest/readme.html

Answered By: user7321751

Answer #4:

How about something like this:

import os
def readlines_reverse(filename):
    with open(filename) as qfile:
        qfile.seek(0, os.SEEK_END)
        position = qfile.tell()
        line = ''
        while position >= 0:
            qfile.seek(position)
            next_char = qfile.read(1)
            if next_char == "n":
                yield line[::-1]
                line = ''
            else:
                line += next_char
            position -= 1
        yield line[::-1]
if __name__ == '__main__':
    for qline in readlines_reverse(raw_input()):
        print qline

Since the file is read character by character in reverse order, it will work even on very large files, as long as individual lines fit into memory.

Answered By: Berislav Lopac

Answer #5:

for line in reversed(open("file").readlines()):
    print line.rstrip()

If you are on linux, you can use tac command.

$ tac file

2 recipes you can find in ActiveState here and here

Answered By: ghostdog74

Answer #6:

import re
def filerev(somefile, buffer=0x20000):
  somefile.seek(0, os.SEEK_END)
  size = somefile.tell()
  lines = ['']
  rem = size % buffer
  pos = max(0, (size // buffer - 1) * buffer)
  while pos >= 0:
    somefile.seek(pos, os.SEEK_SET)
    data = somefile.read(rem + buffer) + lines[0]
    rem = 0
    lines = re.findall('[^n]*n?', data)
    ix = len(lines) - 2
    while ix > 0:
      yield lines[ix]
      ix -= 1
    pos -= buffer
  else:
    yield lines[0]
with open(sys.argv[1], 'r') as f:
  for line in filerev(f):
    sys.stdout.write(line)

Answer #7:

Accepted answer won’t work for cases with large files that won’t fit in memory (which is not a rare case).

As it was noted by others, @srohde answer looks good, but it has next issues:

  • openning file looks redundant, when we can pass file object & leave it to user to decide in which encoding it should be read,
  • even if we refactor to accept file object, it won’t work for all encodings: we can choose file with utf-8 encoding and non-ascii contents like

    ?
    

    pass buf_size equal to 1 and will have

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xb9 in position 0: invalid start byte
    

    of course text may be larger but buf_size may be picked up so it’ll lead to obfuscated error like above,

  • we can’t specify custom line separator,
  • we can’t choose to keep line separator.

So considering all these concerns I’ve written separate functions:

  • one which works with byte streams,
  • second one which works with text streams and delegates its underlying byte stream to the first one and decodes resulting lines.

First of all let’s define next utility functions:

ceil_division for making division with ceiling (in contrast with standard // division with floor, more info can be found in this thread)

def ceil_division(left_number, right_number):
    """
    Divides given numbers with ceiling.
    """
    return -(-left_number // right_number)

split for splitting string by given separator from right end with ability to keep it:

def split(string, separator, keep_separator):
    """
    Splits given string by given separator.
    """
    parts = string.split(separator)
    if keep_separator:
        *parts, last_part = parts
        parts = [part + separator for part in parts]
        if last_part:
            return parts + [last_part]
    return parts

read_batch_from_end to read batch from the right end of binary stream

def read_batch_from_end(byte_stream, size, end_position):
    """
    Reads batch from the end of given byte stream.
    """
    if end_position > size:
        offset = end_position - size
    else:
        offset = 0
        size = end_position
    byte_stream.seek(offset)
    return byte_stream.read(size)

After that we can define function for reading byte stream in reverse order like

import functools
import itertools
import os
from operator import methodcaller, sub
def reverse_binary_stream(byte_stream, batch_size=None,
                          lines_separator=None,
                          keep_lines_separator=True):
    if lines_separator is None:
        lines_separator = (b'r', b'n', b'rn')
        lines_splitter = methodcaller(str.splitlines.__name__,
                                      keep_lines_separator)
    else:
        lines_splitter = functools.partial(split,
                                           separator=lines_separator,
                                           keep_separator=keep_lines_separator)
    stream_size = byte_stream.seek(0, os.SEEK_END)
    if batch_size is None:
        batch_size = stream_size or 1
    batches_count = ceil_division(stream_size, batch_size)
    remaining_bytes_indicator = itertools.islice(
            itertools.accumulate(itertools.chain([stream_size],
                                                 itertools.repeat(batch_size)),
                                 sub),
            batches_count)
    try:
        remaining_bytes_count = next(remaining_bytes_indicator)
    except StopIteration:
        return
    def read_batch(position):
        result = read_batch_from_end(byte_stream,
                                     size=batch_size,
                                     end_position=position)
        while result.startswith(lines_separator):
            try:
                position = next(remaining_bytes_indicator)
            except StopIteration:
                break
            result = (read_batch_from_end(byte_stream,
                                          size=batch_size,
                                          end_position=position)
                      + result)
        return result
    batch = read_batch(remaining_bytes_count)
    segment, *lines = lines_splitter(batch)
    yield from reverse(lines)
    for remaining_bytes_count in remaining_bytes_indicator:
        batch = read_batch(remaining_bytes_count)
        lines = lines_splitter(batch)
        if batch.endswith(lines_separator):
            yield segment
        else:
            lines[-1] += segment
        segment, *lines = lines
        yield from reverse(lines)
    yield segment

and finally a function for reversing text file can be defined like:

import codecs
def reverse_file(file, batch_size=None,
                 lines_separator=None,
                 keep_lines_separator=True):
    encoding = file.encoding
    if lines_separator is not None:
        lines_separator = lines_separator.encode(encoding)
    yield from map(functools.partial(codecs.decode,
                                     encoding=encoding),
                   reverse_binary_stream(
                           file.buffer,
                           batch_size=batch_size,
                           lines_separator=lines_separator,
                           keep_lines_separator=keep_lines_separator))

Tests

Preparations

I’ve generated 4 files using fsutil command:

  1. empty.txt with no contents, size 0MB
  2. tiny.txt with size of 1MB
  3. small.txt with size of 10MB
  4. large.txt with size of 50MB

also I’ve refactored @srohde solution to work with file object instead of file path.

Test script

from timeit import Timer
repeats_count = 7
number = 1
create_setup = ('from collections import dequen'
                'from __main__ import reverse_file, reverse_readlinen'
                'file = open("{}")').format
srohde_solution = ('with file:n'
                   '    deque(reverse_readline(file,n'
                   '                           buf_size=8192),'
                   '          maxlen=0)')
azat_ibrakov_solution = ('with file:n'
                         '    deque(reverse_file(file,n'
                         '                       lines_separator="\n",n'
                         '                       keep_lines_separator=False,n'
                         '                       batch_size=8192), maxlen=0)')
print('reversing empty file by "srohde"',
      min(Timer(srohde_solution,
                create_setup('empty.txt')).repeat(repeats_count, number)))
print('reversing empty file by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('empty.txt')).repeat(repeats_count, number)))
print('reversing tiny file (1MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('tiny.txt')).repeat(repeats_count, number)))
print('reversing tiny file (1MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('tiny.txt')).repeat(repeats_count, number)))
print('reversing small file (10MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('small.txt')).repeat(repeats_count, number)))
print('reversing small file (10MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('small.txt')).repeat(repeats_count, number)))
print('reversing large file (50MB) by "srohde"',
      min(Timer(srohde_solution,
                create_setup('large.txt')).repeat(repeats_count, number)))
print('reversing large file (50MB) by "Azat Ibrakov"',
      min(Timer(azat_ibrakov_solution,
                create_setup('large.txt')).repeat(repeats_count, number)))

Note: I’ve used collections.deque class to exhaust generator.

Outputs

For PyPy 3.5 on Windows 10:

reversing empty file by "srohde" 8.31e-05
reversing empty file by "Azat Ibrakov" 0.00016090000000000028
reversing tiny file (1MB) by "srohde" 0.160081
reversing tiny file (1MB) by "Azat Ibrakov" 0.09594989999999998
reversing small file (10MB) by "srohde" 8.8891863
reversing small file (10MB) by "Azat Ibrakov" 5.323388100000001
reversing large file (50MB) by "srohde" 186.5338368
reversing large file (50MB) by "Azat Ibrakov" 99.07450229999998

For CPython 3.5 on Windows 10:

reversing empty file by "srohde" 3.600000000000001e-05
reversing empty file by "Azat Ibrakov" 4.519999999999958e-05
reversing tiny file (1MB) by "srohde" 0.01965560000000001
reversing tiny file (1MB) by "Azat Ibrakov" 0.019207699999999994
reversing small file (10MB) by "srohde" 3.1341862999999996
reversing small file (10MB) by "Azat Ibrakov" 3.0872588000000007
reversing large file (50MB) by "srohde" 82.01206720000002
reversing large file (50MB) by "Azat Ibrakov" 82.16775059999998

So as we can see it performs like original solution, but is more general and free of its disadvantages listed above.


Advertisement

I’ve added this to 0.3.0 version of lz package (requires Python 3.5+) that have many well-tested functional/iterating utilities.

Can be used like

 import io
 from lz.iterating import reverse
 ...
 with open('path/to/file') as file:
     for line in reverse(file, batch_size=io.DEFAULT_BUFFER_SIZE):
         print(line)

It supports all standard encodings (maybe except utf-7 since it is hard for me to define a strategy for generating strings encodable with it).

Answered By: Azat Ibrakov

Answer #8:

Here you can find my my implementation, you can limit the ram usage by changing the “buffer” variable, there is a bug that the program prints an empty line in the beginning.

And also ram usage may be increase if there is no new lines for more than buffer bytes, “leak” variable will increase until seeing a new line (“n”).

This is also working for 16 GB files which is bigger then my total memory.

import os,sys
buffer = 1024*1024 # 1MB
f = open(sys.argv[1])
f.seek(0, os.SEEK_END)
filesize = f.tell()
division, remainder = divmod(filesize, buffer)
line_leak=''
for chunk_counter in range(1,division + 2):
    if division - chunk_counter < 0:
        f.seek(0, os.SEEK_SET)
        chunk = f.read(remainder)
    elif division - chunk_counter >= 0:
        f.seek(-(buffer*chunk_counter), os.SEEK_END)
        chunk = f.read(buffer)
    chunk_lines_reversed = list(reversed(chunk.split('n')))
    if line_leak: # add line_leak from previous chunk to beginning
        chunk_lines_reversed[0] += line_leak
    # after reversed, save the leakedline for next chunk iteration
    line_leak = chunk_lines_reversed.pop()
    if chunk_lines_reversed:
        print "n".join(chunk_lines_reversed)
    # print the last leaked line
    if division - chunk_counter < 0:
        print line_leak
Answered By: Bekir Dogan

Leave a Reply

Your email address will not be published.