While asking this question, I realized I didn’t know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what
u'' alone does since I get what is Unicode.
But what does
r''do exactly? What kind of string does it result in?
And above all, what the heck does
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does
u''actually do anything?
There’s not really any “raw string“; there are raw string literals, which are exactly the string literals marked by an
'r' before the opening quote.
A “raw string literal” is a slightly different syntax for a string literal, in which a backslash,
, is taken as meaning “just a backslash” (except when it comes right before a quote that would otherwise terminate the literal) — no “escape sequences” to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the “except” clause above doesn’t matter) and it looks a bit better when you avoid doubling up each of them — that’s all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that’s very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the “except” clause above).
r'...' is a byte string (in Python 2.*),
ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example
r"""...""" are all byte strings, and so on).
Not sure what you mean by “going back” – there is no intrinsically back and forward directions, because there’s no raw string type, it’s just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*,
u'...' is of course always distinct from just
'...' — the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
'ciao') 28 sys.getsizeof(u'ciao') 34sys.getsizeof(
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).
There are two types of string in python: the traditional
str type and the newer
unicode type. If you type a string literal without the
u in front you get the old
str type which stores 8-bit characters, and with the
u in front you get the newer
unicode type that can store any Unicode character.
r doesn’t change the type at all, it just changes how the string literal is interpreted. Without the
r, backslashes are treated as escape characters. With the
r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the
str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the
str type if you want to correctly handle unicode characters.
‘raw string’ means it is stored as it appears. For example,
'' is just a backslash instead of an escaping.
Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u’text’ or you can do this by calling unicode(‘text’)
So u is just a short way to call a function to cast str to unicode. That’s it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r’n’ will not create a new line character. It’s just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it’s the only exception.
So this code will produce error: r”
To store a backslash (only one) you need to use ‘\’
If you want to store more than 1 characters you can still use r like r’\’ will produce 2 backslashes as you expected.
I don’t know the reason why r doesn’t work with one backslash storage but the reason isn’t described by anyone yet. I hope that it is a bug.
A “u” prefix denotes the value has type
unicode rather than
Raw string literals, with an “r” prefix, escape any escape sequences within them, so
len(r"n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that’s not a valid escape sequence (e.g.