Coder's Cat

Rust: the weird but safe string

2021-06-30

Ümit Yıldırım

When you begin to learn programming in Rust, your probably first frustration will be the compilation errors from the string.

Rust’s string design is somehow different from most programming languages that have a single string type. Both the Unicode-safe string design and unique ownership mechanism make string in Rust is somehow counterintuitive for beginners.

But, don’t freak out!

Actually, the string data structure is more complicated than most programmers give credit for. And Rust is focused on safety, the compiler will tell you the potential issues about string when you are coding. It’s better to keep safe in development phase, otherwise, you may be lost in debug night.

So, let’s have a bit dive into string in Rust.

What is String

In Rust, String is simply a vector of u8. We can find its definition in source code.

The String type is provided by Rust’s standard library, it is a growable in size and heap-allocated, mutable, string type. When a String variable goes out of scope, the buffer will be freed automatically.

#[derive(PartialOrd, Eq, Ord)]
pub struct String {
vec: Vec<u8>,
}

We can create a String in several ways:

let s = String::new()         // An empty String
let s = "Hello".to_string();
let s = String::from("world");
let s: String = "also this".into();

Rust’s character and string types are designed around Unicode. String is not a sequence of ASCII chars, instead, it is a sequence of Unicode characters. A Rust char type is a 32-bit value holding a Unicode code. String use UTF-8 to encode Unicode characters into a vec.

let hello = String::from("こんにちは");
let hello = String::from("안녕하세요");
let hello = String::from("你好");

And .len() method will return the length measured in bytes, not in chars.

assert_eq!("ಠ_ಠ".len(), 7);
assert_eq!("ಠ_ಠ".chars().count(), 3);

The most un-natural of string is indexing related operation. For instance, you can not just use s[i] to fetch the i-th individual char from a string. Rust strings don’t support indexing?

Why?

Because UTF-8 is a variable-width encoding format, each ASCII character in a string is stored in one byte, while other characters may take up multiple bytes. So, in this format, index into the string’s bytes will not always correlate to a valid Unicode scalar value.

For a string which contains with ASCII character and non-ASCII character, the s[1] means a invalid value.

Instead, we need to access chars with a iterator.

let v = String::from("🗻∈🌏");
let c = v.chars().nth(2);
println!("c: {:?}", c);

But, if you know your string is all ASCII chars, you may use the non-panicking version of get and get_mut to access or modify a part of string in O(1) time. get_mut will return a &mut str, but it still very restrictive, the only operations available on &mut str are make_ascii_uppercase and make_ascii_lowercase, which modify the text in place and affect only single-byte characters.

fn test_get_mut() {
let mut v = String::from("hello");

assert_eq!(Some("he"), v.get_mut(0..2).map(|v| &*v));
let s = v.get_mut(0..2);
let s = s.map(|s| {
s.make_ascii_uppercase();
&*s
});
assert_eq!(Some("HE"), s);
}

If you want to more freedom to modify specific bytes in a String, you need to convert it to a mut vec<u8>:

fn set_char_inplace(text: &mut String, index: usize, c: char) {
let bytes = unsafe { text.as_mut_vec() };
bytes[index] = c as u8;
}

Ok, we talked about much about String right now. Just remember, strings in Rust are guaranteed to be valid UTF-8, and get more familiar with String API if you are confused. UTF-8 Everywhere is a best reference for you to learn why Rust choose UTF-8 as encoding format for some good reasons.

What is &str

str, on the other hand, is a primitive type for compiler, it is also called string slice. We usually use it in the borrowed form: &str.

let a: &str = "hello rust";
//it is same with below
let a: 'static str = "hello rust";
// 'static means this variable will be a reference to this statically allocated string.
// &str cannot be modified,

Essentially, the compiler treats it as a fat pointer, which means: &str is made up of two components: a pointer to some bytes, and a length.
We can use the method as_ptr() to get pointer and len() to get the length of it.

let story = "Hello world";

let ptr = story.as_ptr();
let len = story.len();

String and &str

The relationship between String and &str is like std::string and char* in C++:

  • A Rust String is like a std::string, it owns the memory and does all the dirty job of managing memory.
  • A Rust &str is like a char* (but with length), it points us to the beginning of a chunk with the contents of std::string.

Why &str need to store the length, this is because we can use use &str to pointer to specific part of a String. Let’s have a example:

pub fn main() {
let s: String = String::from("hello");
let a: &str = &s;
let b: &str = &s[1..3];
let c = "beaf";
println!("a: {}, len: {}", a, a.len());
println!("b: {}, len: {}", b, b.len());
println!("c: {}, len: {}", c, c.len());
}

The corresponding diagram is like this:

String or &str?

There are two principles to follow:

  • Use String when you need to own the memory. For example, you created a string in a function and need to return it.

  • Use &str when you want an immutable reference to a String variable (or a string literal in your code).

Remember, Rust will follow move for String when you passing it as parameter to a function:

fn func(a: String) {
println!("{}", a);
}

pub fn main() {
let v = String::from("hello");
func(v);
println!("v: {}", v);
}

This program will report out this error message:

rust-string-error-message

This is because v is moved to the function func and we can not read it in the scope of main anymore. The checker follow the rules of Rust ownership. Rust has this super powerful feature called deref coercing which allows it to turn any &String to a &str.

fn func(a: &str) {
println!("{}", a);
}

pub fn main() {
let v = String::from("hello");
func(&v);
println!("v: {}", v);
}

So, when you are designing a API which only need to access a string as read-only mode, &str is the best choice.

But when you need to modify the content of a String, you can not pass a &mut str which you may think is reasonable. &mut str does exists, but it is not very useful, because a slice cannot reallocate its referent. When you need to modify the content of a string, you need to pass &mut String as parameter, like the above set_char_inplace example.

In the scenario of struct, in most cases, you need to own the string, so String is almost a better choice compared with &str. If you want to use &str as a attribute in struct, you need to handle the lifetime issue.

struct Person<'a> {
name: &'a str,
}
impl<'a> Person<'a> {
fn greet(&self) {
println!("Hello, my name is {}", self.name);
}
}
fn main() {
let person = Person { name: "Coder's Cat" };
person.greet();
}

Other string-like things …

Except for above two kind of strings, Rust also provide other kinds of string-like types for special scenarios. Please check out the reference about them, and here are some general rules we can follow:

  • Stick to String and &str for Unicode text.
  • When working with filenames, use std::path::PathBuf and &Path instead.
  • When working with binary data that isn’t UTF-8 encoded at all, use Vec<u8> and &[u8] instead.
  • When working with environment variable names and command-line arguments in the native form presented by the operating system, use OsString and &OsStr.
  • When interoperating with C libraries that use null-terminated strings, use std::ffi::CString and &CStr.

Join my Email List for more insights, It's Free!😋

Tags: Rust