March 2024
Go to: Baseline | Solutions | Table of results | Final comments
I saw the One Billion Row Challenge a couple of weeks ago, and it thoroughly nerd-sniped me, so I went to Go solve it.
I’m late to the party, as the original competition was in January. It was also in Java. I’m not particularly interested in Java, but I’ve been interested in optimising Go code for a while.
This challenge was particularly simple: process a text file of weather station names and temperatures, and for each weather station, print out the minimum, mean, and maximum. There are a few other constraints to make it simpler, though I ignored the Java-specific ones.
Here are a few lines of example input:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John’s;15.2
Cracow;12.6
…
The only catch: the input file has one billion rows (lines). That’s about 13GB of data. I’ve already figured out that disk I/O is no longer the bottleneck – it’s usually memory allocations and parsing that slow things down in a program like this.
This article describes the nine solutions I wrote in Go, each faster than the previous. The first, a simple and idiomatic solution, runs in 1 minute 45 seconds on my machine, while the last one runs in about 4 seconds. As I go, I’ll show how I used Go’s profiler to see where the time was being spent.
The run-down of solutions is as follows, slowest to fastest:
r1: simple and idiomatic
r2: map with pointer values
r3: parse temperatures by hand
r4: fixed point integers
r5: avoid bytes.Cut
r6: avoid bufio.Scanner
r7: custom hash table
r8: parallelise r1
r9: parallelise r7
I wanted each of the solutions to be portable Go using only the standard library: no assembly, no unsafe, and no memory-mapped files. And 4 seconds, or 3.2GB/s, was fast enough for me. For comparison, the fastest, heavily-optimised Java solution runs in just under a second on my machine – not bad!
There are several other Go solutions out there already, and at least one nice write-up. Mine is faster than some solutions, but slightly slower than the fastest one. However, I didn’t look at any of these before writing mine – I wanted my solutions to be independent.
If you just care about the numbers, skip down to the table of results.
Baseline
Here are a few different baseline measurements to set the stage. First, how long does it take to just read 13GB of data, using cat:
$ time cat measurements.txt>/dev/null
0m1.052s
Note that that’s a best-of-five measurement, so I’m allowing the file to be cached. Who knows whether Linux will allow all 13GB to be kept in disk cache, though presumably it does, because the first time it took closer to 6 seconds.
For comparison, to actually do something with the file is significantly slower: wc takes almost a minute:
$ time wc measurements.txt
1000000000 1179173106 13795293380 measurements.txt
0m55.710s
For a simple solution to the actual problem, I’d probably start with AWK. This solution uses Gawk, because sorting the output is easier with its asorti function. I’m using the -b option to use “characters as bytes” mode, which makes things a bit faster:
$ time gawk -b -f 1brc.awk measurements.txt>measurements.out
7m35.567s
I’m sure I can beat 7 minutes with even a simple Go solution, so let’s start there.
I’m going to start by optimising the sequential, single-core version (solutions 1-7), and then parallelise it (solutions 8 and 9). All my results are done using Go 1.21.5 on a linux/amd64 laptop with a fast SSD drive and 32GB of RAM.
Many of my solutions, and most of the fastest solutions, assume valid input. For example, that the temperatures have exactly one decimal digit. Several of my solutions will cause a runtime panic, or produce incorrect output, if the input isn’t valid.
Solution 1: simple and idiomatic Go
I wanted my first version to be simple, straight-forward code using the tools in the Go standard library: bufio.Scanner to read the lines, strings.Cut to split on the ‘;’, strconv.ParseFloat to parse the temperatures, and an ordinary Go map to accumulate the results.
I’ll include this first solution in full (but after that, show only the interesting bits):
func r1(inputPath string, output io.Writer) error {
type stats struct {
min, max, sum float64
count int64
}
f, err :=os.Open(inputPath)
if err !=nil {
return err
}
defer f.Close()
stationStats :=make(map[string]stats)
scanner :=bufio.NewScanner(f)
for scanner.Scan() {
line :=scanner.Text()
station, tempStr, hasSemi :=strings.Cut(line, “;”)
if !hasSemi {
continue
}
temp, err :=strconv.ParseFloat(tempStr, 64)
if err !=nil {
return err
}
s, ok :=stationStats[station]
if !ok {
s.min=temp
s.max=temp
s.sum=temp
s.count=1
} else {
s.min=min(s.min, temp)
s.max=max(s.max, temp)
s.sum +=temp
s.count++
}
stationStats[station]=s
}
stations :=make([]string, 0, len(stationStats))
for station :=range stationStats {
stations=append(stations, station)
}
sort.Strings(stations)
fmt.Fprint(output, “{“)
for i, station :=range stations {
if i> 0 {
fmt.Fprint(output, “, “)
}
s :=stationStats[station]
mean :=s.sum / float64(s.count)
fmt.Fprintf(output, “%s=%.1f/%.1f/%.1f”, station, s.min, mean, s.max)
}
fmt.Fprint(output, “}n”)
return nil
}
This basic solution processes the one billion rows in 1 minute and 45 seconds. A definite improvement over 7 minutes for the AWK solution.
Solution 2: map with pointer values
I’d learned doing my count-words program that we’re doing a bunch more hashing than we need to. For each line, we’re hashing the string twice: once when we try to fetch the value from the map, and once when we update the map.
To avoid that, we can use a map[string]*stats (pointer values) and update the pointed-to struct, instead of a map[string]stats and updating the hash table itself.
However, I first wanted to confirm that using the Go profiler. It only takes a few lines to add CPU profiling to a Go program.
$ ./go-1brc -cpuprofile=cpu.prof -revision=1 measurements-10000000.txt>measurements-10000000.out
Processed 131.6MB in 965.888929ms
$ go tool pprof -http=: cpu.prof
…
Those commands produced the following profile of solution 1, run over a cut-down 10 million row input file:
Map operations are taking a full 30% of the time: 12.24% for assign and 17.35% for lookup. We should be able to get rid of most of the map assign time by using a pointer value.
As a side note, this profile image also shows us where all the rest of the time is going:
Scanning lines with Scanner.Scan
Finding the ‘;’ with strings.Cut
Parsing the temperature with strconv.ParseFloat
Calling Scanner.Text, which allocates a string for the line
In any case, my second solution is just a small tweak to the map operations:
stationStats :=make(map[string]*stats)
scanner :=bufio.NewScanner(f)
for scanner.Scan() {
// …
s :=stationStats[station]
if s==nil {
stationStats[station]=&stats{
min: temp,
max: temp,
sum: temp,
count: 1,
}
} else {
s.min=min(s.min, temp)
s.max=max(s.max, temp)
s.sum +=temp
s.count++
}
}
In the common case where the station exists in the map, we now only do one map operation, s :=stationStats[station], so that hashing the station name and accessing the hash table only has to be done once. If it’s in the map already – the common case in one billion rows – we update the existing pointed-to struct.
It doesn’t help a ton, but it is something: using pointer values in the map takes our time down from 1 minute 45 seconds to 1 minute 31 seconds.
Solution 3: avoid strconv.ParseFloat
My third solution is where things get a bit more hard-core: parse the temperature using custom code instead of strconv.ParseFloat. The standard library function handles a ton of edge cases we don’t need to support for the simple temperatures our input has: 2 or 3 digits in the format 1.2 or 34.5 (and some with a minus sign in front).
Also, strconv.ParseFloat took a string argument, and now that we’re no longer calling that, we can get away with using the byte slice directly from Scanner.Bytes, instead of allocating and copying a string with Scanner.Text.
Now we parse the temperature this way:
negative :=false
index :=0
if tempBytes[index]==’-‘ {
index++
negative=true
}
temp :=float64(tempBytes[index] – ‘0’) // parse first digit
index++
if tempBytes[index] !=’.’ {
temp=temp*10 + float64(tempBytes[index]-‘0’) // parse optional second digit
index++
}
index++ // skip ‘.’
temp +=float64(tempBytes[index]-‘0’) / 10 // parse decimal digit
if negative {
temp=-temp
}
Not pretty, but not exactly rocket science either. This gets our time down from to 1 minute 31 seconds to under a minute: 55.8 seconds.
Solution 4: fixed point integers
Back in the olden days, floating point instructions were a lot slower than integer ones. These days, they’re only a little slower, but it’s probably worth avoiding them if we can.
For this problem, each temperature has a single decimal digit, so it’s easy to use fixed point integers to represent them. For example, we can represent 34.5 as the integer 345. Then at the end, just before we print out the results, we convert them back to floats.
So my fourth solution is basically the same as solution 3, but with the stats struct field as follows:
type stats struct {
min, max, count int32
sum int64
}
Then, when printing out the results, we need to divide by 10:
mean :=float64(s.sum) / float64(s.count) / 10
fmt.Fprintf(output, “%s=%.1f/%.1f/%.1f”,
station, float64(s.min)/10, mean, float64(s.max)/10)
I’ve used 32-bit integers for minimum and maximum temperatures, as the highest we’ll probably have is about 500 (50 degrees Celsius). We could use int16, but from previous experience, modern 64-bit CPUs are slightly slower when handling 16-bit integers than 32-bit ones. In my tests just now it didn’t seem to make a measurable difference, but I opted for 32-bit anyway.
Using integers cut the time down from 55.8 seconds to 51.0 seconds, a small win.
Solution 5: avoid bytes.Cut
To make solution 5 I recorded another profile (of solution 4):
Okay, now it’s getting harder. The map operations dominate, and moving to a custom hash table will be a bit tricky. So will getting rid of the bufio.Scanner. Let’s procrastinate and get rid of the bytes.Cut.
I thought of a simple way to save time. If we look at a line, for example:
It’s going to be faster to parse the temperature from the end and find the ‘;’ there than to scan through the whole station name to look for the ‘;’. This rather ugly code does just that:
end :=len(line)
tenths :=int32(line[end-1] – ‘0’)
ones :=int32(line[end-3] – ‘0’) // line[end-2] is ‘.’
var temp int32
var semicolon int
if line[end-4]==’;’ { // positive N.N temperature
temp=ones*10 + tenths
semicolon=end – 4
} else if line[end-4]==’-‘ { // negative -N.N temperature
temp=-(ones*10 + tenths)
semicolon=end – 5
} else {
tens :=int32(line[end-4] – ‘0’)
if line[end-5]==’;’ { // positive NN.N temperature
temp=tens*100 + ones*10 + tenths
semicolon=end – 5
} else { // negative -NN.N temperature
temp=-(tens*100 + ones*10 + tenths)
semicolon=end – 6
}
}
station :=line[:semicolon]
Avoiding bytes.Cut cut the time down from 51.0 seconds to 46.0 seconds, another small win.
Solution 6: avoid bufio.Scanner
Now we’re going to try to get rid of the bufio.Scanner. If you think about it, to find the end of each line, the scanner has to look through all the bytes to find the newline character. Then we process many of the bytes again to parse the temperature and find the ‘;’. So let’s try to combine those steps and throw bufio.Scanner out the window.
In solution 6 we allocate a 1MB buffer to read the file in large chunks, look for the last newline in the chunk to ensure we’re not chopping a line in half, and then process each chunk. That looks like this:
buf :=make([]byte, 1024*1024)
readStart :=0
for {
n, err :=f.Read(buf[readStart:])
if err !=nil && err !=io.EOF {
return err
}
if readStart+n==0 {
break
}
chunk :=buf[:readStart+n]
newline :=bytes.LastIndexByte(chunk, ‘n’)
if newline=len(items) {
hashIndex=0
}
}
}
readStart=copy(buf, remaining)
}
The payoff for all this code is large: the custom hash table cuts down the time from 41.3 seconds to 25.8s.
Solution 8: process chunks in parallel
In solution 8 I wanted to add some parallelism. However, I thought I’d go back to the simple and idiomatic code from my first solution, which uses bufio.Scanner and strconv.ParseFloat, and parallelise that. Then we can see whether optimising or parallelising gives better results – and in solution 9 we’ll do both.
It’s straight-forward to parallelise a map-reduce problem like this: split the file into similar-sized chunks (one for each CPU core), fire up a thread (in Go, a goroutine) to process each chunk, and then merge the results at the end.
Here’s what that looks like at a high level:
// Determine non-overlapping parts for file split (each part has offset and size).
parts, err :=splitFile(inputPath, maxGoroutines)
if err !=nil {
return err
}
// Start a goroutine to process each part, returning results on a channel.
resultsCh :=make(chan map[string]r8Stats)
for _, part :=range parts {
go r8ProcessPart(inputPath, part.offset, part.size, resultsCh)
}
// Wait for the results to come back in and aggregate them.
totals :=make(map[string]r8Stats)
for i :=0; i
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Hacker News – https://benhoyt.com/writings/go-1brc/