Golang源码分析 - string字符串

Go中的string是不可变的（只读），因此无法修改，只能使用 len() 获取长度，无法使用 cap() 获取容量，与slice有一定的相似之处，实际上两者底层都使用了一个数组保存所有的数据。

runtime包下的string的结构体为：

type stringStruct struct {
	str unsafe.Pointer
	len int
}

而在reflect包中也存在一个 StringHeader 结构体，对比 reflect.SliceHeader 发现，两者有相似之处。

由于string是只读的，因此无法对其修改，故cap=len。

type StringHeader struct {
	Data uintptr
	Len  int
}

type SliceHeader struct {
  Data uintptr
  Len int
  Cap int
}

拼接

Go支持通过 + 拼接多个string，但是需要注意的是如果拼接后的字符串长度超过32字节，则需要分配内存空间；反之则存放在缓冲区上。

1
2
3

const tmpStringBufSize = 32
// 数组类型，用于将结果保存在栈中
type tmpBuf [tmpStringBufSize]byte

如果有多个字符串通过 + 拼接，会根据拼接的数量调用 concatstring2/3/4/5 ，不过最终还是会调用 concatstrings 。

concatstrings

func concatstrings(buf *tmpBuf, a []string) string {
	idx := 0
	l := 0
	count := 0
  // 统计切片中所有字符串的总长度
	for i, x := range a {
		n := len(x)
		if n == 0 {
			continue
		}
		if l+n < l {
			throw("string concatenation too long")
		}
		l += n
		count++
		idx = i
	}
	if count == 0 {
		return ""
	}

	if count == 1 && (buf != nil || !stringDataOnStack(a[idx])) {
		return a[idx]
	}
  // 分配一定的空间存放结果
	s, b := rawstringtmp(buf, l)
	for _, x := range a {
    // 将字符串复制到b中
		copy(b, x)
		b = b[len(x):]
	}
	return s
}

rawstringtmp

该函数的主要作用分配空间以便存放拼接后的结果，当然，如果缓冲区不为nil且结果字符串长度小于等于32，则直接保存到缓冲区中

func rawstringtmp(buf *tmpBuf, l int) (s string, b []byte) {
  // 将拼接后的结果保存到栈上的缓冲区中
	if buf != nil && l <= len(buf) {
		b = buf[:l]
		s = slicebytetostringtmp(&b[0], len(b))
	} else {
    // 调用 mallocgc 分配内存空间以便存放结果
		s, b = rawstring(l)
	}
	return
}

类型转换

一般使用string需要和byte切片进行类型的转换，分为

string 转为 []byte
[]byte 转为 string

1
2
3

var s1 string = "hello"
var b1 []byte = []byte(s1)
var s2 string = string(b1)

slicebytetostring

对于小于等于32字节的直接在栈上分配（保存在缓冲区 tmpBuf中），返回的string内部实际是引用了缓冲区 tmpBuf，而非分配在堆上的内存。相反，如果byte切片大小超过32字节，则需要分配内存

func slicebytetostring(buf *tmpBuf, ptr *byte, n int) (str string) {
	if n == 0 {
		return ""
	}
  
	...

	var p unsafe.Pointer
  // 对于小于等于32字节的byte切片，保存在临时缓冲区上
	if buf != nil && n <= len(buf) {
		p = unsafe.Pointer(buf)
	} else {
    // 否则分配在堆上
		p = mallocgc(uintptr(n), nil, false)
	}
  // 修改stringStruct结构
	stringStructOf(&str).str = p
	stringStructOf(&str).len = n
  // 将原 []byte 中的字节全部复制到新的内存空间中
	memmove(p, unsafe.Pointer(ptr), uintptr(n))
	return
}

其中的 stringStructOf 函数如下，将 string 换为一个 stringStruct 结构体。

1
2
3

func stringStructOf(sp *string) *stringStruct {
	return (*stringStruct)(unsafe.Pointer(sp))
}

stringtoslicebyte

该函数将一个 string 转换为 []byte ，同样也是根据字符串长度保存到不同位置

func stringtoslicebyte(buf *tmpBuf, s string) []byte {
	var b []byte
  // 结果保存到临时缓冲区
	if buf != nil && len(s) <= len(buf) {
		*buf = tmpBuf{}
		b = buf[:len(s)]
	} else {
    // 保存在堆上
		b = rawbyteslice(len(s))
	}
	copy(b, s)
	return b
}

rawbyteslice

该函数的作用就是在堆上分配一个byte切片内存空间

func rawbyteslice(size int) (b []byte) {
  // 内存对齐，保证对齐后的大小为2的n次方
	cap := roundupsize(uintptr(size))
  // 分配内存
	p := mallocgc(cap, nil, false)
	if cap != uintptr(size) {
		memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size))
	}
	*(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)}
	return
}

我们注意到上面的 roundupsize ，主要作用是进行 内存对齐，使得内存分配的大小尽可能少。

比如下面的一个例子将 string 转换为 []byte 时通过 roundupsize 进行内存对齐。

func main() {
  // 保证s1分配在堆上
	s1 := strings.Repeat("x", 33)
	b1 := []byte(s1)
	fmt.Println(len(s1))
	fmt.Println(len(b1), cap(b1))
}

// 输出
33
33 48

string是 Go 语言中相对来说比较简单的一种数据结构，但是在做拼接和类型转换等操作时一定要注意性能的损耗，特别是遇到需要极致性能的场景一定要尽量减少类型转换的次数，如果需要做转换之类的操作可以使用 unsafe.Pointer 和 reflect.StringHeader ，具体后来再介绍。

性能测试对比

下面我分别将32和33字节的[]byte转换为string来对比两者的性能差距

func BenchmarkStringSlice(t *testing.B) {
	var b1 []byte = bytes.Repeat([]byte{'a'}, 32)
	for i := 0; i < t.N; i++ {
		var s = string(b1)
		_ = s
	}
}

func BenchmarkStringSlice2(t *testing.B) {
	var b1 []byte = bytes.Repeat([]byte{'a'}, 33)
	for i := 0; i < t.N; i++ {
		var s = string(b1)
		_ = s
	}
}

goos: linux
goarch: amd64
pkg: test/test
cpu: Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
BenchmarkStringSlice-2          242390659                5.019 ns/op
BenchmarkStringSlice2-2         37449567                30.84 ns/op

由此可见对于小于等于32字节的byte切片/string分配在栈（缓冲区）上，而大于32字节的分配的堆上，由于涉及到内存的分配和GC，这部分需要消耗过多的时间。

unsafe byte切片和string转换

在Golang中可以使用 unsafe 包实现string和byte切片零拷贝转换，但是如果使用不当也会出现一些问题

func BytesToString2(b []byte) string {
	return *(*string)(unsafe.Pointer(&b))
}

func StringToBytes2(s string) []byte {
	return *(*[]byte)(unsafe.Pointer(&s))
}

上面的方法中 StringToBytes2 有一个问题就是 StringHeader 有两个字段，而 SliceHeader有三个字段，比StringHeader多出一个 Cap 字段，这表示 slice 的容量。

type StringHeader struct {
	Data uintptr
	Len  int
}

type SliceHeader struct {
	Data uintptr
	Len  int
	Cap  int
}

因此在强制转换时 *(*[]byte)(unsafe.Pointer(&s)) 会导致 SliceHeader Cap字段在内存中无法被正确赋值，导致Cap在内存中是一个随机的数值，一般该数值很大。

因此如果采用直接强制转换的方法可能无法通过 cap() 正确获取到切片的容量！其中 byte切片转为 string 没有太大的问题，而是 string 转为 byte切片需要注意。

另外还有一种含有安全隐患的 string 转为 byte 切片方法：

func StringToBytes(s string) []byte {
	v := (*reflect.StringHeader)(unsafe.Pointer(&s))
	bs := reflect.SliceHeader{
		Data: v.Data,
		Len:  v.Len,
		Cap:  v.Len,
	}
	return *(*[]byte)(unsafe.Pointer(&bs))
}

警告：https://pkg.go.dev/reflect#SliceHeader

SliceHeader is the runtime representation of a slice. It cannot be used safely or portably and its representation may change in a later release. Moreover, the Data field is not sufficient to guarantee the data it references will not be garbage collected, so programs must keep a separate, correctly typed pointer to the underlying data.

注意这里的 Data 是一个uintptr整型，把 StringHeader.Data 作为值拷贝给 SliceHeader，后面GC可能移动或者回收该uintptr指向的内存，导致拷贝的uintptr是一个无效的指针！

正确的写法：

以下代码出自 fasthttp 库中的 byteconv.go：
https://github.com/valyala/fasthttp/blob/7a5afddf5b805a022f8e81281c772c11600da2f4/bytesconv.go#L336

func b2s(b []byte) string {
	return *(*string)(unsafe.Pointer(&b))
}

func s2b(s string) (b []byte) {
	bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
	sh := *(*reflect.StringHeader)(unsafe.Pointer(&s))
	bh.Data = sh.Data
	bh.Len = sh.Len
	bh.Cap = sh.Len
	return b
}

总结

Go中的string不可变，同切片一样，也是用一个数组保存所有的数据，在将string和byte切片转化时，根据长度是否超过32将转换结果保存到栈或者堆中，但无论如何都需要将数据进行拷贝，这可能会产生性能问题，因此可以使用 unsafe 包实现零拷贝转换，归根到底还是操作指针。

Golang

Golang Golang源码

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

Golang源码分析 - slice切片上一篇

数论定理以及C++实现下一篇