How can I parse free-text time intervals in Python, ranging from years to seconds?

### Question :

How can I parse free-text time intervals in Python, ranging from years to seconds?

I would like to parse free-text time intervals like the following, using Python:

• 1 second
• 2 minutes
• 3 hours
• 4 days
• 5 weeks
• 6 months
• 7 years

Is there a painless way to do this, ideally by simply calling a library function?

I have tried:

• `dateutil.parser.parse()`, which understands seconds through hours but not days or more.
• `mx.DateTime.DateTimeDeltaFrom()`, which understands through days but fails on weeks or higher, and silently (e.g., it might create an interval of length 0, or parse “2 months” as 2 minutes).

This one is new to me, but based on some googling have you tried whoosh?

Edit: There’s also parsedatetime:

``````#!/usr/bin/env python
from datetime import datetime
import parsedatetime as pdt # \$ pip install parsedatetime

cal = pdt.Calendar()
for time_str in ['1 second', '2 minutes','3 hours','5 weeks','6 months','7 years']:
diff = cal.parseDT(time_str, sourceTime=datetime.min) - datetime.min
print("{time_str:<10} -> {diff!s:>20} <{diff!r}>".format(**vars()))
``````

### Output

``````1 second   ->              0:00:01 <datetime.timedelta(0, 1)>
2 minutes  ->              0:02:00 <datetime.timedelta(0, 120)>
3 hours    ->              3:00:00 <datetime.timedelta(0, 10800)>
5 weeks    ->     35 days, 0:00:00 <datetime.timedelta(35)>
6 months   ->    181 days, 0:00:00 <datetime.timedelta(181)>
7 years    ->   2556 days, 0:00:00 <datetime.timedelta(2556)>
``````

how about `pytimeparse` lib

Returns the time as a number of seconds:

``````from pytimeparse.timeparse import timeparse
>>> timeparse('33m')
1980
>>> timeparse('2h33m')
9180
>>> timeparse('4:17')
257
>>> timeparse('5hr34m56s')
20096
>>> timeparse('1.2 minutes')
72
``````

source seems to be here https://github.com/wroberts/pytimeparse

Not a solution because `dateutil` can parse points in time, but not intervals

[`dateutil`] now supports all of the original requested intervals:

``````from dateutil.parser import parse

examples = """
August 3rd, 2019
2019-08-03
2019, 3rd aug, 2:45 pm
"""

formatted_examples = [
(example, f"{(p := parse(example))} <{p!r}>")
for example in filter(None, examples.splitlines())
]
longest_example = max(map(lambda tup: len(tup), formatted_examples))
longest_parsed = max(map(lambda tup: len(tup), formatted_examples))

for example, parsed_example in formatted_examples:
print(f"{example: <{longest_example}s} -> {parsed_example: >{longest_parsed}s}")
``````

On PyPI, the package is called `python-dateutil`.

## Parsing

We can write a parser. It doesn’t make a huge difference which parser is used. I searched for “python parser” and chose `lark` because it popped up in the top of the results.

First, I defined the units as a mapping. This is where more units could be added, if “centuries” or “microseconds” are needed.

Note: For very small or large numbers, keep in mind `timedelta.resolution`

``````units = {
"second": timedelta(seconds=1),
"minute": timedelta(minutes=1),
"hour":   timedelta(hours=1),
"day":    timedelta(days=1),
"week":   timedelta(weeks=1),
"month":  timedelta(days=30),
"year":   timedelta(days=365),
}
``````

Next, the grammar is defined using `lark`‘s variant of EBNF. Here, `WS` hopefully matches all whitespace:

``````time_interval_grammar = r"""
%import common.WS
%import common.NUMBER

?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
| MINUTE
| HOUR
| DAY
| WEEK
| MONTH
| YEAR
_separator: (WS | ",")+

SECOND: /sw*/i
MINUTE: /miw*/i
HOUR:   /hw*/i
DAY:    /dw*/i
WEEK:   /ww*/i
MONTH:  /mow*/i
YEAR:   /yw*/i

%ignore WS
%ignore ","
"""
``````

The grammar should allow arbitrary time intervals to be chained together, with or without commas as separators.

Each time interval’s unit can be given as the shortest unique prefix:

``````second -> s
minute -> mi
hour   -> h
day    -> d
week   -> w
month  -> mo
year   -> y
``````

Including the ones in the original question, these will serve as the target examples we want to parse:

``````1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 7 years, 2 days, 30 hours, 0.05 seconds
0.0003 years, 100000 seconds
3y 4mo 9min 6d
1mo,3d 1.3e2 hours, 0.04yrs 2mi444
``````

Lastly, I followed one of the `lark` tutorials and used a transformer:

``````class IntervalToTimedelta(Transformer):
def interval(tree: List[timedelta]) -> timedelta:
"sums all timedeltas"

def time(tree: List[Union[float, timedelta]]) -> timedelta:
"returns a timedelta representing the "
return mul(*tree)

def unit(tokens: List[Token]) -> timedelta:
"""
converts a unit into a timedelta that represents 1 of the unit type
"""
return units[tokens.type.lower()]

def number(tokens: List[Token]) -> float:
"returns the value as a python type"
return float(tokens.value)
``````

The grammar is interpreted by `lark.Lark`. Since it is compatible with
`lark`‘s LALR(1) parser, that parser is specified to gain some speed and
improve memory efficiency by allowing the transformer to be used directly by
the parser:

``````time_interval_parser = Lark(
grammar=time_interval_grammar,
start="interval",
parser="lalr",
transformer=IntervalToTimedelta,
)
``````

This produces a mostly working parser. The complete `answer.py` file is this:

``````"""
Example parsing date and time interval with lark
"""
from datetime import timedelta
from functools import reduce
from typing import List, Union

from lark import Lark, Token, Transformer

__all__ = [
"examples",
"IntervalToTimedelta",
"parse",
]

examples = list(
filter(
None,
"""
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
)
)

units = {
"second": timedelta(seconds=1),
"minute": timedelta(minutes=1),
"hour": timedelta(hours=1),
"day": timedelta(days=1),
"week": timedelta(weeks=1),
"month": timedelta(days=30),
"year": timedelta(days=365),
}

time_interval_grammar = r"""
%import common.WS
%import common.NUMBER

?interval: time+
time: value unit _separator?
value: NUMBER -> number
unit: SECOND
| MINUTE
| HOUR
| DAY
| WEEK
| MONTH
| YEAR
_separator: (WS | ",")+

SECOND: /sw*/i
MINUTE: /miw*/i
HOUR:   /hw*/i
DAY:    /dw*/i
WEEK:   /ww*/i
MONTH:  /mow*/i
YEAR:   /yw*/i

%ignore WS
%ignore ","
"""

class IntervalToTimedelta(Transformer):
def interval(tree: List[timedelta]) -> timedelta:
"sums all timedeltas"

def time(tree: List[Union[float, timedelta]]) -> timedelta:
"returns a timedelta representing the "
return mul(*tree)

def unit(tokens: List[Token]) -> timedelta:
"""
converts a unit into a timedelta that represents 1 of the unit type
"""
return units[tokens.type.lower()]

def number(tokens: List[Token]) -> float:
"returns the value as a python type"
return float(tokens.value)

time_interval_parser = Lark(
grammar=time_interval_grammar,
start="interval",
parser="lalr",
transformer=IntervalToTimedelta,
)

parse = time_interval_parser.parse

if __name__ == "__main__":
parsed_examples = [(example, parse(example)) for example in examples]
longest_example = max(map(lambda tup: len(tup), parsed_examples))
longest_formatted = max(map(lambda tup: len(f"{tup!s}"), parsed_examples))
longest_parsed = max(map(lambda tup: len(f"<{tup!r}>"), parsed_examples))
for example, parsed_example in parsed_examples:
print(
f"{example: <{longest_example}s} -> "
f"{parsed_example!s: <{longest_formatted}s} "
f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
)

``````

Running it runs through the examples:

``````\$ python .answer.py
1 second            -> 0:00:01                         <datetime.timedelta(seconds=1)>
2 minutes           -> 0:02:00                       <datetime.timedelta(seconds=120)>
3 hours             -> 3:00:00                     <datetime.timedelta(seconds=10800)>
4 days              -> 4 days, 0:00:00                    <datetime.timedelta(days=4)>
5 weeks             -> 35 days, 0:00:00                  <datetime.timedelta(days=35)>
6 months            -> 180 days, 0:00:00                <datetime.timedelta(days=180)>
7 years             -> 2555 days, 0:00:00              <datetime.timedelta(days=2555)>
1 month, 0.05 weeks -> 30 days, 8:24:00   <datetime.timedelta(days=30, seconds=30240)>
0.003y, 100000secs  -> 2 days, 6:03:28     <datetime.timedelta(days=2, seconds=21808)>
3y 4mo 9min 6d      -> 1221 days, 0:09:00 <datetime.timedelta(days=1221, seconds=540)>
1mo,3d 1.3e2 hours, -> 38 days, 10:00:00  <datetime.timedelta(days=38, seconds=36000)>
0.04yrs 2miasdf     -> 14 days, 14:26:00  <datetime.timedelta(days=14, seconds=51960)>
``````

This works fine, and the performance is adequate:

``````\$ python -m timeit -s "from answer import parse, examples" "for example in examples:" " parse(example)"
500 loops, best of 5: 415 usec per loop
``````

### Potential improvements

Currently, this does not have any error handling, though this is by ommission:
`lark` does raise errors, so the `parse()` function could catch any that can be
handled gracefully.

Some other downsides to this particular implementation:

## Regular Expressions

Alternatively, instead of using a library for parsing, regular expressions can be used with the builtin `re`.

• Regular expressions are challenging to make flexible
• Complex regular expressions are difficult to read
• Regular expressions generally take longer for a human to interpret

It can be faster, though, and should only need the standard library included in CPython.

Using the previous example as a starting point, this is one way regular expressions could be swapped in:

``````"""
Example parsing date and time interval with re
"""
import re
from datetime import timedelta
from functools import reduce
from typing import List, Tuple

__all__ = [
"examples",
"parse",
]

examples = list(
filter(
None,
"""
1 second
2 minutes
3 hours
4 days
5 weeks
6 months
7 years

1 month, 0.05 weeks
0.003y, 100000secs
3y 4mo 9min 6d
1mo,3d 1.3e2 hours,
0.04yrs 2miasdf
""".splitlines(),
)
)

comma = ","
ws = r"s"
separator = fr"[{ws}{comma}]+"

def unit_name(string: str) -> re.Pattern:
return re.compile(fr"{string}w*")

second = unit_name("s")
minute = unit_name("mi")
hour = unit_name("h")
day = unit_name("d")
week = unit_name("w")
month = unit_name("mo")
year = unit_name("y")
units = {
second: timedelta(seconds=1),
minute: timedelta(minutes=1),
hour: timedelta(hours=1),
day: timedelta(days=1),
week: timedelta(weeks=1),
month: timedelta(days=30),
year: timedelta(days=365),
}
unit = re.compile(
"("
+ "|".join(
regex.pattern for regex in [second, minute, hour, day, week, month, year]
)
+ ")"
)
digit = r"d"
integer = fr"({digit}+)"
decimal = fr"({integer}.({integer})?|.{integer})"
signed_integer = fr"([+-]?{integer})"
exponent = fr"([eE]{signed_integer})"
float_ = fr"({integer}{exponent}|{decimal}({exponent})?)"
number = re.compile(fr"({float_}|{integer})")
time = re.compile(fr"(?P<number>{number.pattern}){ws}*(?P<unit>{unit.pattern})")
interval = re.compile(fr"({time.pattern}({separator})*)+", flags=re.IGNORECASE)

def normalize_unit(text: str) -> timedelta:
"maps units to their respective timedelta"
if not unit.match(text):
raise ValueError(f"Not a unit: {text}")

for unit_re in units:
if unit_re.match(text):
return units[unit_re]

raise ValueError(f"No matching unit found: {text}")

def parse(text: str) -> timedelta:
if not interval.match(text):
raise ValueError(f"Parser Error: {text}")

parsed_pairs: List[Tuple[float, timedelta]] = list()
for match in time.finditer(text):
parsed_number = float(match["number"])
parsed_unit = normalize_unit(match["unit"])
parsed_pairs.append((parsed_number, parsed_unit))

timedeltas = [mul(*pair) for pair in parsed_pairs]

if __name__ == "__main__":
parsed_examples = [(example, parse(example)) for example in examples]
longest_example = max(map(lambda tup: len(tup), parsed_examples))
longest_formatted = max(map(lambda tup: len(f"{tup!s}"), parsed_examples))
longest_parsed = max(map(lambda tup: len(f"<{tup!r}>"), parsed_examples))
for example, parsed_example in parsed_examples:
print(
f"{example: <{longest_example}s} -> "
f"{parsed_example!s: <{longest_formatted}s} "
f"{'<' + repr(parsed_example) + '>': >{longest_parsed}s}"
)
``````

The number parsing is mimicked from `lark`‘s builtin grammar definitions.

The performance for this is better:

``````\$ python -m timeit -s "from answer_re import parse, examples" "for example in examples:" " parse(example)"
2000 loops, best of 5: 109 usec per loop
``````

But it’s less readable, and making changes to maintain it will require more work.

## Notes

As-is, both examples behave in a way that doesn’t quite match up with how humans expect time intervals to work:

``````>>> from answer_re import parse
>>> from datetime import datetime
>>> datetime(2000, 1, 1) + parse("9 years")
datetime.datetime(2008, 12, 29, 0, 0)
>>> str(_)
'2008-12-29 00:00:00'
``````

Compare this to what most people would expect it to be: This stack overflow question provides a few solutions, one of which uses `dateutil`. Both of the examples above can be adapted by modifying the `units` mapping to use appropriate `relativedelta`‘s.

This is what the first example would look like:

``````...

units = {
"second": relativedelta(seconds=1),
"minute": relativedelta(minutes=1),
"hour": relativedelta(hours=1),
"day": relativedelta(days=1),
"week": relativedelta(weeks=1),
"month": relativedelta(months=1),
"year": relativedelta(years=1),
}

...
``````

This returns what’s expected:

``````>>> from answer_with_dateutil import parse
>>> fr
``````
