Rust coding challenge wc

Prologue
Challenge
Design
Parsing commandline arguments
Reading file from a valid file path
Tracking counts
What is missing ?
Where will my program fail definitely ?
Improving on wc
Conclusion

Prologue

Since I have some time to build some projects on the side, I have decided to build linux command line tools with Rust, I found this amazing codingchallenges.fyi, which provides a list of tools that need to built with multiple steps as a challenge, and test along the way, for certain challenges tests are provided too.

TL;DR: All the links to the code and my brain dump as I was developing the tool

Challenge

We need to build a clone of wc in any programming language that we see fit, for this one its essential to work with a general programming language which has access to OS features, that allows reading of file.

wc - print newline, word, and byte counts for each file

Challenges I faced while building this tool with Rust

To std::env::args or not ?
How to read the file efficiently and or optimally ?
Buffer size decision - TL:DR; - I simply went with default version, but its worth thinking about if you are seriously considering building wc like utility, even with Rust
Understanding beauty of UTF-8 design
Where my program will definitely fail ?
What could be decent improvement on wc for multi-lingual support

Design

Let’s understand what we are trying to build before we dive into coding part.

wc should run as a commandline program, you probably need a compiled binary
Flags that need to be supported as per the challenge
- -l - count lines in the file
- -c - count the bytes in the file
- -w - count the number of words in the file
- -m - count the number of multi-byte characters in the file
Input should be either in the form of file path or stdin
- File path example:
```
ccwc -l test.txt
```
- stdin example:
```
cat test.txt | ccwc -l
```

So at a high level,

parse and map flags into program state,
reading file path,
reading a file from a given file path,
track counts / stats for the file

Parsing commandline arguments

Since we want a binary that works in commandline, we need to support reading commandline arguments,

the good old way of doing this is, use std::env::args which returns an Iterator which yields args passed to the program, where the first argument is the program name itself(generally skipped since it can be unreliable in certain cases)

use std::env;

// Print each argument on newline
// skips 1st argument which is the program name itself
for argument in env::args().skip(1) {
  println!("{argument}");
}

Challenge 1: To std::env::args or not ?

std::env:args works fine with unicode strings (UTF-8), your program in Rust will run just fine, but there is a catch, since we are working with file paths, its possible that you might have an invalid UTF-8 name, which will break your Rust program, since std::env::args will try to take the argument of the process and cast it into a String, in Rust, strings are guaranteed to be valid UTF-8,but if you pass in a file name which is invalid UTF-8, program panics.

Why do OS / file systems allow non UTF-8 filenames ?

For philosophical and historical reasons, every OS has some quirks around the file naming conventions,

Linux - The kernel is encoding-agnostic, it treats filename as a sequence of raw bytes, with only two strict rules, should not contain a forward slash(/) should not contain a null byte character(\0)
Windows - Windows will always support UTF-16, since it has a kernel that works with UTF-16.
MacOS - Has first class support for UTF-8 in OS X since 2001. While it started supporting Unicode(UTF-16) as early as 1998 in System 8.5. So mac OS, has had unicode for nearly 2 decades and is now maturing into APFS file systems.

Long story short, its because how the OS were built, so Rust has a solution around this OsString - A type that represents owned, mutable platform-native strings, but is cheaply inter-convertible with Rust strings.

So we use std::env::args_os to make sure we deal with the filenames properly.

for arg in args_os().skip(1) {
  # Here arg is an OsString
}

Now that we have found a decent way to read commandline arguments, lets work with it, we get arg as OsString, its not a Rust string but there is a method on it that allows to convert it to a Rust string.

if let Some(s) = arg.to_str() && s.starts_with("-") {
  # flags start with - and we want to collect those
}

In order to collect all the flags (either individual or as a combination), we need some sort of struct to track that

struct Flags {
    lines: bool,
    bytes: bool,
    words: bool,
    multibyte_char_count: bool,
}

impl Flags {
    fn is_any_set(&self) -> bool {
        self.lines || self.bytes || self.words || self.multibyte_char_count
    }

    fn apply_defaults(&mut self) {
        if !self.is_any_set() {
            self.lines = true;
            self.bytes = true;
            self.words = true;
        }
    }
}

We create a simple struct, structs are what Classes to Objects in JavaScript, impl means implementations of functionality for a type, so we are defining methods on the type Flags,

is_any_set -> Checks if any flag is set by the argument passed in the program
apply_defaults -> Checks if any flag is set else sets the default flags

    let mut flags: Flags = Flags {
        lines: false,
        bytes: false,
        words: false,
        multibyte_char_count: false,
    };
    for arg in args_os().skip(1) {
        if let Some(s) = arg.to_str()
            && s.starts_with("-")
        {
            for flag in s.chars().skip(1) {
                match flag {
                    'l' => flags.lines = true,
                    'w' => flags.words = true,
                    'c' => flags.bytes = true,
                    'm' => flags.multibyte_char_count = true,
                    unknown => {
                        eprintln!("Error: unknown flag '{}'", unknown);
                        std::process::exit(1);
                    }
                }
            }
        } else {
            filename = Some(arg);
        }
    }

    // default flags are applied in case no flag is passed in
    flags.apply_defaults();

This handles recording of flags with the loops, it handles correctly

individual flags (-l -c -w -m)
combination of flags(-lcw -m)
Applies defaults same as wc (-lcw)

For filename I wrap the arg in Some(), since its possible that filename could or could not be passed in, given that we can accept input from stdin

let mut filename: Option<OsString> = None;

So I define the filename variable as such, Option indicates either there will be some value or None.

filename will be none in this case:

cat test.txt | ccwc -l

and this is a valid use-case

Reading file from a valid file path

Multiple ways to read a file from a given file path

Un buffered read -> Read from a file, one character at a time

pub fn read_unbuffered_one_char() -> io::Result<u64> {
    let mut file = File::open(FILENAME)?;
    let len = file.metadata().expect("Failed to get metadata").len() as usize;
    let mut v:Vec<u8> = Vec::new();
    v.resize(len,0u8);
    for index in 0..len {
        file.read_exact(&mut v[index..(index+1)])?;
    }
    let s = String::from_utf8(v).expect("file is not UTF-8 ?");
    let mut total = 0u64;
    for line in s.lines() {
        total += get_count_from_line(line);
    }
    Ok(total);
}

Worst case scenario where I read the contents of a file, one character at a time.

Buffered, allocation of a new string

pub fn read_buffered_allocate_string_every_time() -> io::Result<u64>{
    let file = File::open(FILENAME)?;
    let reader = BufReader::new(file);
    let mut total = 0u64;
    for line in reader.lines() {
        let s = line?;
        total += get_count_from_line(&s);
    }
    Ok(total);
}

We are using BufReader class to wrap the file and read it in buffer-size chunks in a loop, the default size of the chunk is 8KB,

In case anyone wondering how I got that,

#[stable(feature = "rust1", since = "1.0.0")]
pub fn new(inner: R) -> BufReader<R> {
    BufReader::with_capacity(DEFAULT_BUF_SIZE, inner)
}

// Bare metal platforms usually have very small amounts of RAM
// (in the order of hundreds of KB)
pub const DEFAULT_BUF_SIZE: usize = if cfg!(target_os = "espidf") { 512 } else { 8 * 1024 };

Just keep looking for definition of BufReader::new and DEFAULT_BUF_SIZE until you land at this code, DEFAULT_BUF_SIZE has a OS level check to determine the minimum buffer size, here minimum target os is the one used by ESP chips.

Using a buffered read is whole lot better than reading a character at a time, since the number of sys calls are reduced, drastically compared to reading a character at a time.

Buffered, reusing the string buffer

pub fn read_buffered_reuse_string() -> io::Result<u64> {
    let file = File::open(FILENAME)?;
    let mut reader = BufReader::new(file);
    let mut string = String::new();
    let mut total = 0u64;
    while reader.read_line(&mut string).unwrap() > 0 {
        total +=get_count_from_line(&string);
        string.clear();
    }
    Ok(total)
}

This is similar to previous function, difference is that we allocate one String and pass this in to reader.read_line(), this is much faster than previous code, since we are avoid creation of new string in each iteration.

Reading the whole string from disk into a giant buffer

pub fn read_buffer_whole_string_into_memory() -> io::Result<u64> {
  let mut file = File::open(FILENAME)?;
  let mut s = String::new();
  file.read_to_string(&mut s)?;
  let mut total = 0u64;
  for line in s.lines() {
    total += get_count_from_line(line);
  }
  Ok(total)
}

This is an extreme version of buffer, we allocate one giant ass buffer and read the whole string into it all at once. There is potential downside with this approach, buffers are typically allocated on RAM, and theoretically and practically there is no limit that Rust can put on the buffer size, so with this approach its completely possible to allocate a buffer size larger than RAM, at which point your program will crash eventually with a message

Out of memory

fatal runtime error: memory allocation failed

This is the best approach in case you know ahead of time what the file size is and is within limits of RAM size. Let’s say you want to read a file with size 8MB, this is fine, if you have a RAM of 16GB and want to read a file of size 40GB - bad idea, since it will need memory even beyond 40GB to read it.

Challenge 2: How to read optimally for this challenge ?

I specifically want to count lines, words, bytes, and multi-byte characters (based on encoding),

Lines (-l) - are separated by ‘\n’ or ‘\r\n’ ( we should be good with \n)
Words (-w) - are separated by ascii whitespace ( we can utilize a method is_ascii_whitespace for this)
Bytes (-c) - 8 bits (1s and 0s) make a byte,
Multi bytes characters (-m) - Need to find the boundaries for the character based on encoding, although I am focusing on UTF-8 encoding (there could be other encodings..)

What I want is quickly be able to look at a byte and tell if it fits above criteria, I honestly don’t need a String or a Vec for it, so BufReader has a trait called BufRead and certain methods defined in it,

pub trait BufRead: Read {
    // Required methods
    fn fill_buf(&mut self) -> Result<&[u8]>;
    fn consume(&mut self, amount: usize);

    // Provided methods
    fn has_data_left(&mut self) -> Result<bool> { ... }
    fn read_until(&mut self, byte: u8, buf: &mut Vec<u8>) -> Result<usize> { ... }
    fn skip_until(&mut self, byte: u8) -> Result<usize> { ... }
    fn read_line(&mut self, buf: &mut String) -> Result<usize> { ... }
    fn split(self, byte: u8) -> Split<Self>
       where Self: Sized { ... }
    fn lines(self) -> Lines<Self>
       where Self: Sized { ... }
}

I decided to go with fill_buf and consume - Since I need access to the bytes of the file, and the string itself, also there is no copying of data from memory to some data structure like String or Vec, which makes this approach a lot more efficient, since what you get is a reference (slice of a buffer), so no copying of data.

consume(length) - lets me mark that given slice of buffer to be done with so that next fill_buf will give me access to next slice, consider this as a sliding window over a very large array and once you are done reading the first slice you move the window forward, so on and so forth until you hit the buffer size limit (8KB).

Challenge 3: Buffer size -> What should it be ? Just go with default.

Interesting part of this problem is the buffer size, should I keep it default or change it to something else ? Like may be 16KB or 32KB or 64KB - How do I know what size would be right for which platform ?

Tracking counts

Now that we have access to bytes we can add code to get all required counts,

#[derive(Debug)]
struct Stats {
    lines: usize,
    bytes: usize,
    words: usize,
    multibyte_char_count: usize,
}

fn count_stats<R: BufRead>(mut reader: R, flags: &Flags) -> Result<Stats> {
    # initialize the stats
    let mut stats = Stats {
        lines:0,
        bytes: 0,
        multibyte_char_count:0,
        words:0
    }
}

I am taking in 2 arguments, a mutable reader, reference to flags,

<R: BufRead> -> is a trait bound, we want the reader to have trait BufRead, it sets up a variable name for the trait here,

Other ways to define the same function signature

fn count_stats<R>(mut reader:R, flags:&Flags) -> Result<Stats> where R: BufRead {}
fn count_stats(mut reader:impl BufRead, flags: &flags) -> Result<Stats> {}

Return type is result since reader.fill_buf can fail and I want to propagate such errors at the top level, this is the way of idiomatic Rust.

Let’s assume that I am writing all the code within this function, and adding on top of every change

loop {
    # get access to a slice of a buffer
    let buffer = reader.fill_buf()?;
    let len = buffer.len();
    // once the reader reaches EOF, break the loop
    if buffer.is_empty() {
        break;
    }

    for &b in buffer {
        # do some counting
    }

    reader.consume(len);
}

We have setup the logic for reading from the buffer,and exit conditions,

reader.fill_buf -> returns a Result<&[u8]> -> slice of bytes of an internal buffer
buffer.is_empty() -> checks if the buffer is empty this is true once we have reached EOF
reader.consume(len) -> marks the bytes as read and will allow to return bytes that have not been marked as read.

The counting part begins now,

    for &b in buffer {
        if b == b'\n' && flags.lines {
            stats.lines += 1;
        }

        if flags.words {
            if b.is_ascii_whitespace() {
                in_word = false;
            } else if !in_word {
                in_word = true;
                stats.words += 1;
            }
        }

        if flags.multibyte_char_count {
            if is_utf8 {
                if (b & 0xC0) != 0x80 {
                    stats.multibyte_char_count += 1;
                }
            } else {
                stats.multibyte_char_count += 1;
            }
        }
    }

Counting lines

We check if the given byte matches with a b\n (newline character), and if it does we increment the count.
Counting words

Counting words need some tracking whether we are inside a word or outside, since words can be split across lines, we need to check if the words are separated by whitespace

Its best to rely on in-built functions of rust in this case , for example as below, since its completely possible to be able skip some of the cases, simply because I don’t know better !!

#[inline]
pub const fn is_ascii_whitespace(&self) -> bool {
    matches!(*self, b'\t' | b'\n' | b'' | b'\r' | b' ')
}

if b.is_ascii_whitespace() {
    in_word = false;
} else if !in_word {
    in_word = true;
    stats.words+=1;
}

Counting bytes

One of the easiest stat to count, we just keep adding len of buffer to the stats,
```
let len = buffer.len();
if flags.bytes {
    stats.bytes += len;
}
```
Counting multi-byte characters

This largely depends on the encoding of the system locale, its so surprising for me to find out that a file is a bunch of bytes that’s it, how it needs to be read is completely determined by the OS locale.That’s a historical mess we inherit, nonetheless let’s work with this enlightening information.

So the logic for checking for multi byte characters is this,
1. Get the locale from the environment variables
2. If the encoding supports multi-byte characters, we add logic to handle that as per the encoding standards
3. Else, we report number of bytes as multi-byte character count (same output as -c)
Step 1. Get the system locale
```
fn is_local_utf8() -> bool {
    let vars = ["LC_ALL", "LC_CTYPE", "LANG"];
    for var in vars {
        if let Ok(val) = env::var(var) {
            let val_lower = val.to_lowercase();
            if val_lower.contains("utf-8") || val_lower.contains("utf8") {
                return true;
            }
        }
    }
    false
}
```
We are looking at 3 env variables , LC_ALL, LC_CTYPE, LANG in that order to get if the encoding is utf-8, while there are other multi-byte character encodings, I am scoping this problem to UTF-8 only, because I thought, it would get tedious and slow to support all kinds of encodings and the logic for each encoding could be wildly different and really hard to do it. Rust by default only deals with UTF-8 strings, that’s not a compromise, that’s a sane default. So sometimes you need to have sane constraints to have a decent program, you can’t do everything so shouldn’t your program.

Step 2. Count multi-byte characters

So a multi-byte character is a character that spans multiple bytes, in case of UTF-8 (1-4 bytes long), so we need to know while counting the bytes, how long is this character, and what constitutes a single character ?

UTF-8 design is pretty decent compared to any other multi-byte character encoding, its got a clear design around demarcation of characters in it.

So there are start bytes and continuation bytes,
```
# assume that x could be 0s or 1s
let one_byte:u8 = 0b0xxxxxxx;
let two_byte:u8 = 0b110xxxxx_10xxxxxx;
let three_byte_long:u8 = 0b1110xxxx_10xxxxxx_10xxxxxx;
let four_byte_long:u8 = 0b11110xxx_10xxxxxx_10xxxxxx_10xxxxxx;
```
Start bytes - First byte of the character, it will either start with 0 (in case of 1 byte long character) or by 1s (number of 1s depends on how much byte long the character is) Continuation bytes - All the continuation bytes will start with 10
```
# this check is outside loop
let is_utf8 = is_local_utf8();

# within the loop and inside the for loop
if flags.multibyte_char_count {
    if is_utf8 {
        if (b & 0xC0) != 0x80 {
            stats.multibyte_char_count += 1;
        }
    } else {
        stats.multibyte_char_count += 1;
    }
}
```
Most crucial check

(b & 0xC0) != 0x80

What does this even mean ?

b -> represents a byte, in case the encoding is utf-8 possible values can be the start bytes or continuation bytes as discussed above 0xC0 -> 11000000

b & 11000000 -> This is bitwise & operation

0x80 -> 10000000 (represents a continuation byte)

we know 1 & 0 -> 0, 1 & 1 -> 1, similarly we when do
```
   11101011 -> if b is a start byte
 & 11000000
 -----------
   11000000 -> we get 0xC0
 
```
```
   10110101 -> if b is a continuation byte
 & 11000000
 -----------
   10000000 -> 0x80
 
```
so this is all good, but why do I don’t do

b & 0xC0 === 0xC0

There is a case for single byte in UTF-8 (where a character could be represented by a single byte in UTF-8)
```
   01110101 -> a single byte character in UTF-8
 & 11000000
 -----------
   01000000 -> its not 0x80
 
```
so I can use that logic to handle all the cases, i.e.

(b & 0xC0) !== 0x80

What is missing ?

While this code passes all the steps mentioned in the challenge, so its good for challenge, but there are some gaps I see when I compare it with wc and real-world expectations:

Support for multiple files: GNU wc accepts multiple filenames and prints a total at the end. My code handles one file, this would need a loop over OsString args after flags, per-file stats, and aggregation.
Maximum line length (-L): is included in modern wc
Performance optimizations: This is a big one, SIMD for counting or mmap for very large files,
Tests: Missing unit tests and integration tests
Help and version flags: Although not strictly required, these are great quality of life improvements
Error handling resilience: Error handling is present but does not feel quite complete, this could be improved on easily
Proper return exit codes: I have sprinkled a lot of std::process:exit(1) calls in there, or not in some places where, its required

Where will my program fail definitely ?

So this program is far from perfect, in fact it will fail in certain cases:

Invalid UTF-8 in non UTF-8 locales for -m: Feed it invalid UTF-8 sequences while assuming UTF-8 mode could lead to incorrect counts
Very large files:fill_buf + consume works fine and is efficient, but for really large files say 100GB could lead to some issues,
Non-ASCII whitespace for words: is_ascii_whitespace() is fast and correct as per POSIX, but some users expect Unicode whitespace to separate words.
Performance with small files: The buffered approach has some overhead; for many tiny files it might be slower than a simpler read_to_string

Improving on `wc`

Modern text is rarely pure ASCII. For example, Let’s consider Marathi or Hindi,

It has following structure:

Vowels - Just like A,E,I,O,U (अ, आ, इ, ई, उ, ऊ, ए, ऐ, ॲ, ओ, औ, ऑ, अं,अः)
Vowel signs (Matras) - These are signs that can be added to consonants to change their sounds and meaning (ा,), For example, क (k-uh) is a consonant add a kana and it becomes का (kaa)
Consonants - Marathi has 36 consonants, they are often grouped by how they are pronounced - using your throat,palate, teeth For example, consonants pronounced with your throat are these (क ख ग घ ङ)
Conjuncts- You can connect two consonants ! What it mean is one consonant is broken and connected with a whole one !

This is complex to represent with ASCII alone, now if we run wc on a conjunct in Marathi,

> echo -n "नी" | wc -m
 2 // expected 1

Actual output is 2 but should be 1, since linguistically its a single character, but from encoding perspective its two codepoints (1 consonant + 1 vowel sign). This could be an interesting improvement for wc I think counting grapheme clusters, what humans perceive as a single character, this could be done with unicode-segmentation crate

Conclusion

Overall a great exercise to understand concepts of buffered read, handling bytes, encodings, a file is nothing but a bag of bytes, and appreciate challenges of dealing with various encodings. This also made me realize that learning Rust, makes you dive deeper into system fundamentals that I would not have otherwise.