Rust coding challenge wc
Published at 10-05-2026
Table of Contents
- Prologue
- Challenge
- Design
- Parsing commandline arguments
- Reading file from a valid file path
- Tracking counts
- What is missing ?
- Where will my program fail definitely ?
- Improving on
wc - Conclusion
Prologue
Since I have some time to build some projects on the side, I have decided to build linux command line tools with Rust, I found this amazing codingchallenges.fyi, which provides a list of tools that need to built with multiple steps as a challenge, and test along the way, for certain challenges tests are provided too.
TL;DR: All the links to the code and my brain dump as I was developing the tool
Challenge
We need to build a clone of wc in any programming language that we see fit, for this one its essential to work with a
general programming language which has access to OS features, that allows reading of file.
wc - print newline, word, and byte counts for each file
Challenges I faced while building this tool with Rust
- To std::env::args or not ?
- How to read the file efficiently and or optimally ?
- Buffer size decision - TL:DR; - I simply went with default version, but its worth thinking about if you are seriously
considering building
wclike utility, even with Rust - Understanding beauty of UTF-8 design
- Where my program will definitely fail ?
- What could be decent improvement on
wcfor multi-lingual support
Design
Let’s understand what we are trying to build before we dive into coding part.
wcshould run as a commandline program, you probably need a compiled binary- Flags that need to be supported as per the challenge
- -l - count lines in the file
- -c - count the bytes in the file
- -w - count the number of words in the file
- -m - count the number of multi-byte characters in the file
- Input should be either in the form of
file pathorstdin- File path example:
ccwc -l test.txt - stdin example:
cat test.txt | ccwc -l
- File path example:
So at a high level,
- parse and map flags into program state,
- reading file path,
- reading a file from a given file path,
- track counts / stats for the file
Parsing commandline arguments
Since we want a binary that works in commandline, we need to support reading commandline arguments,
the good old way of doing this is, use std::env::args which returns an Iterator which yields args passed
to the program, where the first argument is the program name itself(generally skipped since it can be unreliable in
certain cases)
use std::env;
// Print each argument on newline
// skips 1st argument which is the program name itself
for argument in env::args().skip(1) {
println!("{argument}");
} Challenge 1: To std::env::args or not ?
std::env:args works fine with unicode strings (UTF-8), your program in Rust will run just fine, but there is a catch,
since we are working with file paths, its possible that you might have an invalid UTF-8 name, which will break your
Rust program, since std::env::args will try to take the argument of the process and cast it into a String, in Rust,
strings are guaranteed to be valid UTF-8,but if you pass in a file name which is invalid UTF-8, program panics.
Why do OS / file systems allow non UTF-8 filenames ?
For philosophical and historical reasons, every OS has some quirks around the file naming conventions,
- Linux - The kernel is encoding-agnostic, it treats filename as a sequence of raw bytes, with only two strict rules, should not contain a forward slash(/) should not contain a null byte character(\0)
- Windows - Windows will always support UTF-16, since it has a kernel that works with UTF-16.
- MacOS - Has first class support for UTF-8 in OS X since 2001. While it started supporting Unicode(UTF-16) as early as 1998 in System 8.5. So mac OS, has had unicode for nearly 2 decades and is now maturing into APFS file systems.
Long story short, its because how the OS were built, so Rust has a solution around this OsString - A type that
represents owned, mutable platform-native strings, but is cheaply inter-convertible with Rust strings.
So we use std::env::args_os to make sure we deal with the filenames properly.
for arg in args_os().skip(1) {
# Here arg is an OsString
} Now that we have found a decent way to read commandline arguments, lets work with it, we get arg as OsString,
its not a Rust string but there is a method on it that allows to convert it to a Rust string.
if let Some(s) = arg.to_str() && s.starts_with("-") {
# flags start with - and we want to collect those
} In order to collect all the flags (either individual or as a combination), we need some sort of struct to track that
struct Flags {
lines: bool,
bytes: bool,
words: bool,
multibyte_char_count: bool,
}
impl Flags {
fn is_any_set(&self) -> bool {
self.lines || self.bytes || self.words || self.multibyte_char_count
}
fn apply_defaults(&mut self) {
if !self.is_any_set() {
self.lines = true;
self.bytes = true;
self.words = true;
}
}
} We create a simple struct, structs are what Classes to Objects in JavaScript, impl means
implementations of functionality for a type, so we are defining methods on the type Flags,
is_any_set-> Checks if any flag is set by the argument passed in the programapply_defaults-> Checks if any flag is set else sets the default flags
let mut flags: Flags = Flags {
lines: false,
bytes: false,
words: false,
multibyte_char_count: false,
};
for arg in args_os().skip(1) {
if let Some(s) = arg.to_str()
&& s.starts_with("-")
{
for flag in s.chars().skip(1) {
match flag {
'l' => flags.lines = true,
'w' => flags.words = true,
'c' => flags.bytes = true,
'm' => flags.multibyte_char_count = true,
unknown => {
eprintln!("Error: unknown flag '{}'", unknown);
std::process::exit(1);
}
}
}
} else {
filename = Some(arg);
}
}
// default flags are applied in case no flag is passed in
flags.apply_defaults(); This handles recording of flags with the loops, it handles correctly
- individual flags (-l -c -w -m)
- combination of flags(-lcw -m)
- Applies defaults same as
wc(-lcw)
For filename I wrap the arg in Some(), since its possible that filename could or could not be passed in, given that
we can accept input from stdin
let mut filename: Option<OsString> = None; So I define the filename variable as such, Option indicates either there will be some value or None.
filename will be none in this case:
cat test.txt | ccwc -l and this is a valid use-case
Reading file from a valid file path
Multiple ways to read a file from a given file path
- Un buffered read -> Read from a file, one character at a time
pub fn read_unbuffered_one_char() -> io::Result<u64> {
let mut file = File::open(FILENAME)?;
let len = file.metadata().expect("Failed to get metadata").len() as usize;
let mut v:Vec<u8> = Vec::new();
v.resize(len,0u8);
for index in 0..len {
file.read_exact(&mut v[index..(index+1)])?;
}
let s = String::from_utf8(v).expect("file is not UTF-8 ?");
let mut total = 0u64;
for line in s.lines() {
total += get_count_from_line(line);
}
Ok(total);
} Worst case scenario where I read the contents of a file, one character at a time.
- Buffered, allocation of a new string
pub fn read_buffered_allocate_string_every_time() -> io::Result<u64>{
let file = File::open(FILENAME)?;
let reader = BufReader::new(file);
let mut total = 0u64;
for line in reader.lines() {
let s = line?;
total += get_count_from_line(&s);
}
Ok(total);
} We are using BufReader class to wrap the file and read it in buffer-size chunks in a loop, the default
size of the chunk is 8KB,
In case anyone wondering how I got that,
#[stable(feature = "rust1", since = "1.0.0")]
pub fn new(inner: R) -> BufReader<R> {
BufReader::with_capacity(DEFAULT_BUF_SIZE, inner)
} // Bare metal platforms usually have very small amounts of RAM
// (in the order of hundreds of KB)
pub const DEFAULT_BUF_SIZE: usize = if cfg!(target_os = "espidf") { 512 } else { 8 * 1024 }; Just keep looking for definition of BufReader::new and DEFAULT_BUF_SIZE until you land at this code, DEFAULT_BUF_SIZE has a OS level check to determine the minimum buffer size, here minimum target os is the one used by
ESP chips.
Using a buffered read is whole lot better than reading a character at a time, since the number of sys calls are reduced, drastically compared to reading a character at a time.
- Buffered, reusing the string buffer
pub fn read_buffered_reuse_string() -> io::Result<u64> {
let file = File::open(FILENAME)?;
let mut reader = BufReader::new(file);
let mut string = String::new();
let mut total = 0u64;
while reader.read_line(&mut string).unwrap() > 0 {
total +=get_count_from_line(&string);
string.clear();
}
Ok(total)
} This is similar to previous function, difference is that we allocate one String and pass this in
to reader.read_line(), this is much faster than previous code, since we are avoid creation of new
string in each iteration.
- Reading the whole string from disk into a giant buffer
pub fn read_buffer_whole_string_into_memory() -> io::Result<u64> {
let mut file = File::open(FILENAME)?;
let mut s = String::new();
file.read_to_string(&mut s)?;
let mut total = 0u64;
for line in s.lines() {
total += get_count_from_line(line);
}
Ok(total)
} This is an extreme version of buffer, we allocate one giant ass buffer and read the whole string into it all at once. There is potential downside with this approach, buffers are typically allocated on RAM, and theoretically and practically there is no limit that Rust can put on the buffer size, so with this approach its completely possible to allocate a buffer size larger than RAM, at which point your program will crash eventually with a message
Out of memory fatal runtime error: memory allocation failed This is the best approach in case you know ahead of time what the file size is and is within limits of RAM size. Let’s
say you want to read a file with size 8MB, this is fine, if you have a RAM of 16GB and want to read a file of size 40GB - bad idea, since it will need memory even beyond 40GB to read it.
Challenge 2: How to read optimally for this challenge ?
I specifically want to count lines, words, bytes, and multi-byte characters (based on encoding),
Lines (-l)- are separated by ‘\n’ or ‘\r\n’ ( we should be good with\n)Words (-w)- are separated by ascii whitespace ( we can utilize a methodis_ascii_whitespacefor this)Bytes (-c)- 8 bits (1s and 0s) make a byte,Multi bytes characters (-m)- Need to find the boundaries for the character based on encoding, although I am focusing on UTF-8 encoding (there could be other encodings..)
What I want is quickly be able to look at a byte and tell if it fits above criteria, I honestly don’t need a String or a Vec for it, so BufReader has a trait called BufRead and certain methods defined in it,
pub trait BufRead: Read {
// Required methods
fn fill_buf(&mut self) -> Result<&[u8]>;
fn consume(&mut self, amount: usize);
// Provided methods
fn has_data_left(&mut self) -> Result<bool> { ... }
fn read_until(&mut self, byte: u8, buf: &mut Vec<u8>) -> Result<usize> { ... }
fn skip_until(&mut self, byte: u8) -> Result<usize> { ... }
fn read_line(&mut self, buf: &mut String) -> Result<usize> { ... }
fn split(self, byte: u8) -> Split<Self>
where Self: Sized { ... }
fn lines(self) -> Lines<Self>
where Self: Sized { ... }
} I decided to go with fill_buf and consume - Since I need access to the bytes of the file, and the string itself, also
there is no copying of data from memory to some data structure like String or Vec, which makes this approach a lot
more efficient, since what you get is a reference (slice of a buffer), so no copying of data.
consume(length) - lets me mark that given slice of buffer to be done with so that next fill_buf will give me
access to next slice, consider this as a sliding window over a very large array and once you are done reading the
first slice you move the window forward, so on and so forth until you hit the buffer size limit (8KB).
Challenge 3: Buffer size -> What should it be ? Just go with default.
Interesting part of this problem is the buffer size, should I keep it default or change it to something else ? Like
may be 16KB or 32KB or 64KB - How do I know what size would be right for which platform ?
Tracking counts
Now that we have access to bytes we can add code to get all required counts,
#[derive(Debug)]
struct Stats {
lines: usize,
bytes: usize,
words: usize,
multibyte_char_count: usize,
}
fn count_stats<R: BufRead>(mut reader: R, flags: &Flags) -> Result<Stats> {
# initialize the stats
let mut stats = Stats {
lines:0,
bytes: 0,
multibyte_char_count:0,
words:0
}
} I am taking in 2 arguments, a mutable reader, reference to flags,
<R: BufRead> -> is a trait bound, we want the reader to have trait
BufRead, it sets up a variable name for the trait here,Other ways to define the same function signature
fn count_stats<R>(mut reader:R, flags:&Flags) -> Result<Stats> where R: BufRead {} fn count_stats(mut reader:impl BufRead, flags: &flags) -> Result<Stats> {}Return type is result since
reader.fill_bufcan fail and I want to propagate such errors at the top level, this is the way of idiomatic Rust.
Let’s assume that I am writing all the code within this function, and adding on top of every change
loop {
# get access to a slice of a buffer
let buffer = reader.fill_buf()?;
let len = buffer.len();
// once the reader reaches EOF, break the loop
if buffer.is_empty() {
break;
}
for &b in buffer {
# do some counting
}
reader.consume(len);
} We have setup the logic for reading from the buffer,and exit conditions,
reader.fill_buf-> returns aResult<&[u8]>-> slice of bytes of an internal bufferbuffer.is_empty()-> checks if the buffer is empty this is true once we have reached EOFreader.consume(len)-> marks thebytesas read and will allow to return bytes that have not been marked as read.
The counting part begins now,
for &b in buffer {
if b == b'\n' && flags.lines {
stats.lines += 1;
}
if flags.words {
if b.is_ascii_whitespace() {
in_word = false;
} else if !in_word {
in_word = true;
stats.words += 1;
}
}
if flags.multibyte_char_count {
if is_utf8 {
if (b & 0xC0) != 0x80 {
stats.multibyte_char_count += 1;
}
} else {
stats.multibyte_char_count += 1;
}
}
} Counting lines
We check if the given byte matches with a
b\n(newline character), and if it does we increment the count.Counting words
Counting words need some tracking whether we are inside a word or outside, since words can be split across lines, we need to check if the words are separated by whitespace
Its best to rely on in-built functions of rust in this case , for example as below, since its completely possible to be able skip some of the cases, simply because I don’t know better !!
#[inline]
pub const fn is_ascii_whitespace(&self) -> bool {
matches!(*self, b'\t' | b'\n' | b'' | b'\r' | b' ')
} if b.is_ascii_whitespace() {
in_word = false;
} else if !in_word {
in_word = true;
stats.words+=1;
} Counting bytes
One of the easiest stat to count, we just keep adding len of buffer to the stats,
let len = buffer.len(); if flags.bytes { stats.bytes += len; }Counting multi-byte characters
This largely depends on the encoding of the system locale, its so surprising for me to find out that a file is a bunch of bytes that’s it, how it needs to be read is completely determined by the OS locale.That’s a historical mess we inherit, nonetheless let’s work with this enlightening information.
So the logic for checking for multi byte characters is this,
- Get the locale from the environment variables
- If the encoding supports multi-byte characters, we add logic to handle that as per the encoding standards
- Else, we report number of bytes as multi-byte character count (same output as -c)
Step 1. Get the system locale
fn is_local_utf8() -> bool { let vars = ["LC_ALL", "LC_CTYPE", "LANG"]; for var in vars { if let Ok(val) = env::var(var) { let val_lower = val.to_lowercase(); if val_lower.contains("utf-8") || val_lower.contains("utf8") { return true; } } } false }We are looking at 3 env variables ,
LC_ALL, LC_CTYPE, LANGin that order to get if the encoding isutf-8, while there are other multi-byte character encodings, I am scoping this problem toUTF-8only, because I thought, it would get tedious and slow to support all kinds of encodings and the logic for each encoding could be wildly different and really hard to do it. Rust by default only deals withUTF-8strings, that’s not a compromise, that’s a sane default. So sometimes you need to have sane constraints to have a decent program, you can’t do everything so shouldn’t your program.Step 2. Count multi-byte characters
So a multi-byte character is a character that spans multiple bytes, in case of UTF-8 (1-4 bytes long), so we need to know while counting the bytes, how long is this character, and what constitutes a single character ?
UTF-8 design is pretty decent compared to any other multi-byte character encoding, its got a clear design around demarcation of characters in it.
So there are start bytes and continuation bytes,
# assume that x could be 0s or 1s let one_byte:u8 = 0b0xxxxxxx; let two_byte:u8 = 0b110xxxxx_10xxxxxx; let three_byte_long:u8 = 0b1110xxxx_10xxxxxx_10xxxxxx; let four_byte_long:u8 = 0b11110xxx_10xxxxxx_10xxxxxx_10xxxxxx;Start bytes- First byte of the character, it will either start with 0 (in case of 1 byte long character) or by 1s (number of 1s depends on how much byte long the character is)Continuation bytes- All the continuation bytes will start with 10# this check is outside loop let is_utf8 = is_local_utf8(); # within the loop and inside the for loop if flags.multibyte_char_count { if is_utf8 { if (b & 0xC0) != 0x80 { stats.multibyte_char_count += 1; } } else { stats.multibyte_char_count += 1; } }Most crucial check
(b & 0xC0) != 0x80What does this even mean ?
b -> represents a byte, in case the encoding is utf-8 possible values can be the start bytes or continuation bytes as discussed above 0xC0 -> 11000000
b & 11000000 -> This is bitwise & operation
0x80 -> 10000000 (represents a continuation byte)
we know 1 & 0 -> 0, 1 & 1 -> 1, similarly we when do
11101011 -> if b is a start byte & 11000000 ----------- 11000000 -> we get 0xC0
10110101 -> if b is a continuation byte & 11000000 ----------- 10000000 -> 0x80
so this is all good, but why do I don’t do
b & 0xC0 === 0xC0There is a case for single byte in UTF-8 (where a character could be represented by a single byte in UTF-8)
01110101 -> a single byte character in UTF-8 & 11000000 ----------- 01000000 -> its not 0x80
so I can use that logic to handle all the cases, i.e.
(b & 0xC0) !== 0x80
What is missing ?
While this code passes all the steps mentioned in the challenge, so its good for challenge, but there are some gaps I
see when I compare it with wc and real-world expectations:
- Support for multiple files: GNU
wcaccepts multiple filenames and prints a total at the end. My code handles one file, this would need a loop overOsStringargs after flags, per-file stats, and aggregation. - Maximum line length (-L): is included in modern
wc - Performance optimizations: This is a big one, SIMD for counting or
mmapfor very large files, - Tests: Missing unit tests and integration tests
- Help and version flags: Although not strictly required, these are great quality of life improvements
- Error handling resilience: Error handling is present but does not feel quite complete, this could be improved on easily
- Proper return exit codes: I have sprinkled a lot of std::process:exit(1) calls in there, or not in some places where, its required
Where will my program fail definitely ?
So this program is far from perfect, in fact it will fail in certain cases:
- Invalid UTF-8 in non UTF-8 locales for
-m: Feed it invalid UTF-8 sequences while assuming UTF-8 mode could lead to incorrect counts - Very large files:
fill_buf+consumeworks fine and is efficient, but for really large files say100GBcould lead to some issues, - Non-ASCII whitespace for words:
is_ascii_whitespace()is fast and correct as per POSIX, but some users expect Unicode whitespace to separate words. - Performance with small files: The buffered approach has some overhead; for many tiny files it might be slower than
a simpler
read_to_string
Improving on wc
Modern text is rarely pure ASCII. For example, Let’s consider Marathi or Hindi,
It has following structure:
- Vowels - Just like A,E,I,O,U (अ, आ, इ, ई, उ, ऊ, ए, ऐ, ॲ, ओ, औ, ऑ, अं,अः)
- Vowel signs (Matras) - These are signs that can be added to consonants to change their sounds and meaning (ा,), For example, क (k-uh) is a consonant add a kana and it becomes का (kaa)
- Consonants - Marathi has 36 consonants, they are often grouped by how they are pronounced - using your throat,palate, teeth For example, consonants pronounced with your throat are these (क ख ग घ ङ)
- Conjuncts- You can connect two consonants ! What it mean is one consonant is broken and connected with a whole one !
This is complex to represent with ASCII alone, now if we run wc on a conjunct in Marathi,
> echo -n "नी" | wc -m
2 // expected 1 Actual output is 2 but should be 1, since linguistically its a single character, but from encoding perspective its
two codepoints (1 consonant + 1 vowel sign). This could be an interesting improvement for wc I think counting grapheme clusters,
what humans perceive as a single character, this could be done with unicode-segmentation crate
Conclusion
Overall a great exercise to understand concepts of buffered read, handling bytes, encodings, a file is nothing but a bag of bytes, and appreciate challenges of dealing with various encodings. This also made me realize that learning Rust, makes you dive deeper into system fundamentals that I would not have otherwise.