How do I read a random line from one file?

Posted on

Question :

How do I read a random line from one file?

Is there a built-in method to do it? If not how can I do this without costing too much overhead?

Not built-in, but algorithm `R(3.4.2)` (Waterman’s “Reservoir Algorithm”) from Knuth’s “The Art of Computer Programming” is good (in a very simplified version):

``````import random

def random_line(afile):
line = next(afile)
for num, aline in enumerate(afile, 2):
if random.randrange(num):
continue
line = aline
return line
``````

The `num, ... in enumerate(..., 2)` iterator produces the sequence 2, 3, 4… The `randrange` will therefore be 0 with a probability of `1.0/num` — and that’s the probability with which we must replace the currently selected line (the special-case of sample size 1 of the referenced algorithm — see Knuth’s book for proof of correctness == and of course we’re also in the case of a small-enough “reservoir” to fit in memory ;-))… and exactly the probability with which we do so.

``````import random
myline =random.choice(lines)
print(myline)
``````

For very long file:
seek to random place in file based on it’s length and find two newline characters after position (or newline and end of file). Do again 100 characters before or from beginning of file if original seek position was <100 if we ended up inside the last line.

However this is over complicated, as file is iterator.So make it list and take random.choice (if you need many, use random.sample):

``````import random
print(random.choice(list(open('file.txt'))))
``````

It depends what do you mean by “too much” overhead. If storing whole file in memory is possible, then something like

``````import random

``````

would do the trick.

Although I am four years late, I think I have the fastest solution. Recently I wrote a python package called linereader, which allows you to manipulate the pointers of file handles.

Here is the simple solution to getting a random line with this package:

``````from random import randint

length = #lines in file
filename = #directory of file

file = dopen(filename)
random_line = file.getline(randint(1, length))
``````

The first time this is done is the worst, as linereader has to compile the output file in a special format. After this is done, linereader can then access any line from the file quickly, whatever size the file is.

If your file is very small (small enough to fit into an MB), then you can replace `dopen` with `copen`, and it makes a cached entry of the file within memory. Not only is this faster, but you get the number of lines within the file as it is loaded into memory; it is done for you. All you need to do is to generate the random line number. Here is some example code for this.

``````from random import randint

file = copen(filename)
lines = file.count('n')
random_line = file.getline(randint(1, lines))
``````

I just got really happy because I saw someone who could benefit from my package! Sorry for the dead answer, but the package could definitely be applied to many other problems.

If you don’t want to read over the entire file, you can seek into the middle of the file, then seek backwards for the newline, and call `readline`.

Here is a Python3 script which does just this,

One disadvantage with this method is short lines have lower likelyhood of showing up.

``````def read_random_line(f, chunk_size=16):
import os
import random
with open(f, 'rb') as f_handle:
f_handle.seek(0, os.SEEK_END)
size = f_handle.tell()
i = random.randint(0, size)
while True:
i -= chunk_size
if i < 0:
chunk_size += i
i = 0
f_handle.seek(i, os.SEEK_SET)
i_newline = chunk.rfind(b'n')
if i_newline != -1:
i += i_newline + 1
break
if i == 0:
break
f_handle.seek(i, os.SEEK_SET)
``````

A slightly improved version of the Alex Martelli’s answer, which handles empty files (by returning a `default` value):

``````from random import randrange

def random_line(afile, default=None):
line = default
for i, aline in enumerate(afile, start=1):
if randrange(i) == 0:  # random int [0..i)
line = aline
return line
``````

This approach can be used to get a random item from any iterator using `O(n)` time and `O(1)` space.

Seek to a random position, read a line and discard it, then read another line. The distribution of lines won’t be normal, but that doesn’t always matter.

If you don’t want to load the whole file into RAM with `f.read()` or `f.readlines()`, you can get random line this way:

``````import os
import random

def get_random_line(filepath: str) -> str:
file_size = os.path.getsize(filepath)
with open(filepath, 'rb') as f:
while True:
pos = random.randint(0, file_size)
if not pos:  # the first line is chosen