I have a long sequence, and I would like to know how often some sub-sequences occur in this sequence.
I know string.count(s, sub), but it only counts non-overlapping sequences.
Does a similar function which also counts overlapping sequences exist?
As an alternative to writing your own search function, you could use the
In : import re In : haystack = 'abababa baba alibababa' In : needle = 'baba' In : matches = re.finditer(r'(?=(%s))' % re.escape(needle), haystack) In : print [m.start(1) for m in matches] [1, 3, 8, 16, 18]
The above prints out the starting positions of all (potentially overlapping) matches.
If all you need is the count, the following should do the trick:
In : len(re.findall(r'(?=(%s))' % re.escape(needle), haystack)) Out: 5
A simple to understand way to do it is:
def count(sub, string): count = 0 for i in xrange(len(string)): if string[i:].startswith(sub): count += 1 return count count('baba', 'abababa baba alibababa') #output: 5
If you like short snippets, you can make it less readable but smarter:
def count(subs, s): return sum((s[i:].startswith(subs) for i in xrange(len(s))))
This uses the fact that Python can treat boolean like integers.
This should help you :
matches = st = 'abababa baba alibababa' needle = 'baba' for i in xrange(len(st)-len(needle)+1): i = st.find(needle,i,i+len(needle)) if(i >= 0): matches.append(st.find(needle,i,i+len(needle))) print(str(matches))
see it here : http://codepad.org/pmkKXmWB
Did not benchmark it for long strings, see if its efficient enough for your use.
I learnt today that you could use a running index to fetch the next occurrence of your substring:
string = 'bobobobobobobob' # long string or variable here count = 0 start = 0 while True: index = string.find('bob', start) if index >= 0: count += 1 start += 1 else: break print(count)