{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0a687b21",
   "metadata": {},
   "source": [
    "(test)=\n",
    "\n",
    "# Test your code\n",
    "\n",
    "```{epigraph}\n",
    "Most scientists who write software constantly test their code. That is, if you are a scientist writing software, I am sure that you have tried to see how well your code works by running every new function you write, examining the inputs and the outputs of the function, to see if the code runs properly (without error), and to see whether the results make sense. Automated code testing takes this informal practice, makes it formal, and automates it, so that you can make sure that your code does what it is supposed to do, even as you go about making changes around it.\n",
    "\n",
    "---[Ariel Rokem](https://github.com/uwescience/shablona)\n",
    "```\n",
    "\n",
    "Automated testing is one of the most powerful techniques that professional programmers use to make code robust. Having never used testing until I went to industry, it changed the way I write code for the better.\n",
    "\n",
    "## Testing to maintain your sanity\n",
    "\n",
    "When you run an experiment and the results of the analysis don't make sense, you will go through a process of eliminating one potential cause after the other. You will investigate several hypotheses, including:\n",
    "\n",
    "- the data is bad\n",
    "- you're loading the data incorrectly\n",
    "- your model is incorrectly implemented\n",
    "- your model is inappropriate for the data\n",
    "- the statistical test you used is inappropriate for the data distribution\n",
    "\n",
    "Testing can help you maintain your sanity by decreasing the surface of things that might be wrong with your experiment. Good code yells loudly when something goes wrong. Imagine that you had an experimental setup that alerted you when you had a ground loop, or that would sound off when you use the wrong reagent, or that would text you when it's about to overheat: how many hours or days would you save?\n",
    "\n",
    "## Unit testing by example\n",
    "\n",
    "Unit testing is the practice of testing a _unit_ of code, typically a single function. The easiest way to understand what that means is to illustrate it with a specific example. The Fibonacci sequence is defined as:\n",
    "\n",
    "$$F(x) \\equiv F(x-1) + F(x-2)$$\n",
    "$$F(0) \\equiv 0 $$\n",
    "$$F(1) \\equiv 1 $$\n",
    "\n",
    "[The first few items in the Fibonacci sequence](https://oeis.org/A000045) are:\n",
    "\n",
    "$$F = 0, 1, 1, 2, 3, 5, 8, 13, 21, \\ldots$$\n",
    "\n",
    "Let's write up a naive implementation of this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "23a3e448",
   "metadata": {},
   "outputs": [],
   "source": [
    "def fib(x):\n",
    "    if x <= 2:\n",
    "        return 1\n",
    "    else:\n",
    "        return fib(x - 1) + fib(x - 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d63a0df0",
   "metadata": {},
   "source": [
    "Let's say that a colleague brings you this code and asks you to check that the code they've written up works. How would check whether this code works?\n",
    "\n",
    "````{dropdown} Spoilers\n",
    "You could run this code on the command line with different inputs and check that the code works as expected. For instance, you expect that:\n",
    "\n",
    "```pycon\n",
    ">>> fib(0)\n",
    "0\n",
    ">>> fib(1)\n",
    "1\n",
    ">>> fib(2)\n",
    "1\n",
    ">>> fib(6)\n",
    "8\n",
    ">>> fib(40)\n",
    "102334155\n",
    "```\n",
    "\n",
    "You could also run the code with bad inputs, to check whether the code returns meaningful errors. For example, the sequence is undefined for negative numbers or non-integers.\n",
    "````\n",
    "\n",
    "Informal testing can be done in an interactive computing environment, like the `ipython` REPL or a jupyter notebook. Run the code, check the output, repeat until the code works right---it's a workflow you've probably used as well.\n",
    "\n",
    "### Lightweight formal tests with `assert`\n",
    "\n",
    "One issue with informal tests is that they often have a short shelf life. Once the code is written and informal testing is over, you don't have a record of that testing. You might even discard the tests you wrote in jupyter! We can make our tests stick with `assert`.\n",
    "\n",
    "`assert` is a special statement in Python that throws an error whenever the statement is false. For instance,\n",
    "\n",
    "```\n",
    ">>> assert 1 == 0\n",
    "Traceback (most recent call last):\n",
    "  File \"<stdin>\", line 1, in <module>\n",
    "AssertionError\n",
    "```\n",
    "\n",
    "Notice that there are no parentheses between `assert` and the statement. `assert` is great for inline tests, for example checking whether the shape or a matrix is as expected after permuting its indices.\n",
    "\n",
    "We can also assemble multiple assert operations to create a lightweight test suite. You can hide your asserts behind an `__name__ == '__main__'` statement, so that they will only run when you directly run a file. Let's write some tests in `fib.py`:\n",
    "\n",
    "```\n",
    "def fib(x):\n",
    "    if x <= 2:\n",
    "        return 1\n",
    "    else:\n",
    "        return fib(x - 1) + fib(x - 2)\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    assert fib(0) == 0\n",
    "    assert fib(1) == 1\n",
    "    assert fib(2) == 1\n",
    "    assert fib(6) == 8\n",
    "    assert fib(40) == 102334155\n",
    "    print(\"Tests passed\")\n",
    "```\n",
    "\n",
    "Now we can run the tests from the command line:\n",
    "\n",
    "```console\n",
    "$ python fib.py\n",
    "Traceback (most recent call last):\n",
    "  File \"fib.py\", line 8, in <module>\n",
    "    assert fib(0) == 0\n",
    "AssertionError\n",
    "```\n",
    "\n",
    "We see our test suite fail immediately for `fib(0)`. We can fix up the boundary conditions of the code, and run the code again. We repeat this process until all our tests pass. Let's look at the fixed up code:\n",
    "\n",
    "```\n",
    "def fib(x):\n",
    "    if x == 0:\n",
    "        return 0\n",
    "    if x == 1:\n",
    "        return 1\n",
    "    else:\n",
    "        return fib(x - 1) + fib(x - 2)\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    assert fib(0) == 0\n",
    "    assert fib(1) == 1\n",
    "    assert fib(2) == 1\n",
    "    assert fib(6) == 8\n",
    "    assert fib(40) == 102334155\n",
    "    print(\"Tests passed\")\n",
    "```\n",
    "\n",
    "While the first few tests pass, the last one hangs for a long time. What's going on here?\n",
    "\n",
    "### Refactoring with confidence with tests\n",
    "\n",
    "Our `fib(N)` function hangs for a large value of `N` because it spawns a lot of repeated computation. `fib(N)` calls both `fib(N-1)` and `fib(N-2)`. In turn, `fib(N-1)` calls `fib` twice, and so on and so forth. Therefore, the time complexity of this function scales exponentially as $O(2^N)$: it's very slow.\n",
    "\n",
    "We can re-implement this function so that it keeps a record of previously computed values. One straightforward way of doing this is with a global cache. **We keep our previously implemented tests**, and rewrite the function:\n",
    "\n",
    "```\n",
    "cache = {}\n",
    "def fib(x):\n",
    "    global cache\n",
    "    if x in cache:\n",
    "        return cache[x]\n",
    "    if x == 0:\n",
    "        return 0\n",
    "    elif x == 1:\n",
    "        return 1\n",
    "    else:\n",
    "        val = fib(x - 1) + fib(x - 2)\n",
    "        cache[x] = val\n",
    "        return val\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    assert fib(0) == 0\n",
    "    assert fib(1) == 1\n",
    "    assert fib(2) == 1\n",
    "    assert fib(6) == 8\n",
    "    assert fib(40) == 102334155\n",
    "    print(\"Tests passed\")\n",
    "```\n",
    "\n",
    "Running this new and improved script, we see:\n",
    "\n",
    "```console\n",
    "$ python fib.py\n",
    "Tests passed\n",
    "```\n",
    "\n",
    "Hurray! We can be confident that our code works as expected. What if we want to refactor our code so that it doesn't use globals? Not a problem, we keep the tests around, and we rewrite the code to use an inner function:\n",
    "\n",
    "```\n",
    "def fib(x):\n",
    "    cache = {}\n",
    "    def fib_inner(x):\n",
    "        nonlocal cache\n",
    "        if x in cache:\n",
    "            return cache[x]\n",
    "        if x == 0:\n",
    "            return 0\n",
    "        elif x == 1:\n",
    "            return 1\n",
    "        else:\n",
    "            val = fib_inner(x - 1) + fib_inner(x - 2)\n",
    "            cache[x] = val\n",
    "            return val\n",
    "    return fib_inner(x)\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    assert fib(0) == 0\n",
    "    assert fib(1) == 1\n",
    "    assert fib(2) == 1\n",
    "    assert fib(6) == 8\n",
    "    assert fib(40) == 102334155\n",
    "    print(\"Tests passed\")\n",
    "```\n",
    "\n",
    "Running the module again, our tests still pass! Testing helps us refactor with confidence because we can immediately tell whether we've introduced new bugs in our code.\n",
    "\n",
    "### Testing pure functions\n",
    "\n",
    "With pure functions, such as `fib`, we can readily come up with ways to test whether the code works or not. We can check:\n",
    "\n",
    "- _Correctness for typical inputs_, e.g. $F(5) = 5$\n",
    "- _Edge cases_, e.g. $F(0) = 0$\n",
    "- _Errors_ with bad input, e.g. $F(-1)$ $\\rightarrow$ _error_\n",
    "- _Functional goals are achieved_, e.g. that the function works for large numbers\n",
    "\n",
    "Pure functions don't require elaborate setups to test properly, and indeed they have some of the highest _bang for your buck_ when it comes to testing. If in your current workflow, you would have manually checked whether a procedure yielded reasonable results, write a test for it.\n",
    "\n",
    "```{tip}\n",
    "If something caused a bug, write a test for it. 70% of bugs are old bugs that keep reappearing.\n",
    "```\n",
    "\n",
    "### Testing with a test suite\n",
    "\n",
    "Testing with `assert` hidden behind `__name__ == '__main__'` works great for small-scale testing. However, once you have a lot of tests, it starts to make sense to group them into a _test suite_ and run them with a _test runner_. There are two main frameworks to run unit tests in Python, `pytest` and `unittest`. `pytest` is the more popular of the two, so I'll cover that here.\n",
    "\n",
    "To install pytest on your system, first run:\n",
    "\n",
    "```python\n",
    "pip install -U pytest\n",
    "```\n",
    "\n",
    "Writing a test suite for pytest is a matter of taking our previous unit tests and putting them in a separate file, wrapping them in functions which start with `test_`. In `tests/test_fib.py`, we write:\n",
    "\n",
    "```\n",
    "from src.fib import fib\n",
    "import pytest\n",
    "\n",
    "def test_typical():\n",
    "    assert fib(1) == 1\n",
    "    assert fib(2) == 1\n",
    "    assert fib(6) == 8\n",
    "    assert fib(40) == 102334155\n",
    "\n",
    "def test_edge_case():\n",
    "    assert fib(0) == 0\n",
    "\n",
    "def test_raises():\n",
    "    with pytest.raises(NotImplementedError):\n",
    "        fib(-1)\n",
    "\n",
    "    with pytest.raises(NotImplementedError):\n",
    "        fib(1.5)\n",
    "```\n",
    "\n",
    "Notice that pytest primarily relies on the `assert` statement to do the heavy lifting. `pytest` also offers extra functionality to deal with special test cases. `pytest.raises` creates a context manager to verify that a function raises an expected exception.\n",
    "\n",
    "Running the `pytest` utility from the command line, we find:\n",
    "\n",
    "```console\n",
    "$ pytest test_fib.py\n",
    "...\n",
    "    def fib_inner(x):\n",
    "        nonlocal cache\n",
    "        if x in cache:\n",
    "            return cache[x]\n",
    ">       if x == 0:\n",
    "E       RecursionError: maximum recursion depth exceeded in comparison\n",
    "\n",
    "../src/fib.py:7: RecursionError\n",
    "============================= short test summary info ==========================\n",
    "FAILED test_fib.py::test_raises - RecursionError: maximum recursion depth exceed\n",
    "=========================== 1 failed, 2 passed in 1.18s ========================\n",
    "```\n",
    "\n",
    "Notice how informative the output of pytest is compared to our homegrown test suite. `pytest` informs us that two of our tests passed---`test_typical` and `test_edge_case`---while the last one failed. Calling our `fib` function with a negative argument or a non-integer argument will make the function call itself recursively with negative numbers - it never stops! Hence, Python eventually will generate a `RecursionError`. However, our tests are expecting a `NotImplementedError` instead! Our test correctly detected that the code has this odd behavior. We can fix it up like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ad49a4ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "def fib(x):\n",
    "    if x % 1 != 0 or x < 0:\n",
    "        raise NotImplementedError('fib only defined on non-negative integers.')\n",
    "    cache = {}\n",
    "    def fib_inner(x):\n",
    "        nonlocal cache\n",
    "        if x in cache:\n",
    "            return cache[x]\n",
    "        if x == 0:\n",
    "            return 0\n",
    "        elif x == 1:\n",
    "            return 1\n",
    "        else:\n",
    "            val = fib_inner(x - 1) + fib_inner(x - 2)\n",
    "            cache[x] = val\n",
    "            return val\n",
    "    return fib_inner(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "824fd28e",
   "metadata": {},
   "source": [
    "Now we can run tests again.\n",
    "\n",
    "```console\n",
    "$ pytest test_fib.py\n",
    "=============================== test session starts ============================\n",
    "platform linux -- Python 3.8.8, pytest-6.2.4, py-1.10.0, pluggy-0.13.1\n",
    "rootdir: /home/pmin/Documents/codebook\n",
    "plugins: anyio-3.1.0\n",
    "collected 3 items\n",
    "\n",
    "test_fib.py ...                                                           [100%]\n",
    "\n",
    "================================ 3 passed in 0.02s =============================\n",
    "```\n",
    "\n",
    "They pass!\n",
    "\n",
    "## Testing non-pure functions and classes\n",
    "\n",
    "I claimed earlier that _pure functions_ are the easiest to test. Let's see what we need to do to test non-pure functions. For a _nondeterministic_ function, you can usually give the random seed or random variables needed by the function as arguments, turning the nondeterministic function into a deterministic one. For a _stateful_ function, we need to additionally test that:\n",
    "\n",
    "- _Postconditions are met_, that is, the internal state of the function or object is changed in the expected way by the code\n",
    "\n",
    "Classes are stateful, so we'll need to inspect their state after calling methods on them to make sure they work as expected. For example, consider this Chronometer class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4d16e2af",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "class Chronometer:\n",
    "    def start(self):\n",
    "        self.t0 = time.time()\n",
    "\n",
    "    def stop(self):\n",
    "        return time.time() - self.t0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "192a2b53",
   "metadata": {},
   "source": [
    "We might want to check that the `t0` variable is indeed set by the `start` method.\n",
    "\n",
    "For a function with _I/O side effects_, we'll need to do a little extra work to verify that it works. We might need to create mock files to check whether inputs are read properly and outputs are as expected. `io.StringIO` and the `tempfile` module can help you create these mock objects. For instance, suppose we have a function `file_to_upper` that takes in an input and an output filename, and turns every letter into an uppercase:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9645df5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def file_to_upper(in_file, out_file):\n",
    "    fout = open(out_file, 'w')\n",
    "    with open(in_file, 'r') as f:\n",
    "        for line in f:\n",
    "            fout.write(line.upper())\n",
    "    fout.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc5d7267",
   "metadata": {},
   "source": [
    "Writing a test for this is a little tortured:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "68a9daa9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import tempfile\n",
    "import os\n",
    "\n",
    "def test_upper():\n",
    "    in_file = tempfile.NamedTemporaryFile(delete=False, mode='w')\n",
    "    out_file = tempfile.NamedTemporaryFile(delete=False)\n",
    "    out_file.close()\n",
    "    in_file.write(\"test123\\nthetest\")\n",
    "    in_file.close()\n",
    "    file_to_upper(in_file.name, out_file.name)\n",
    "    with open(out_file.name, 'r') as f:\n",
    "        data = f.read()\n",
    "        assert data == \"TEST123\\nTHETEST\"\n",
    "    os.unlink(in_file.name)\n",
    "    os.unlink(out_file.name)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae8f1705",
   "metadata": {},
   "source": [
    "With remote calls and persistent storage, testing can rapidly become quite complex.\n",
    "\n",
    "## A hierarchy of tests\n",
    "\n",
    "We've been focused so far on _unit tests_. However, there are many different kinds of tests that people use.\n",
    "\n",
    "- _Static tests_: your editor parses and runs your code as you write it to figure out if it will crash\n",
    "- _Inline asserts_: test whether intermediate computations are as expected\n",
    "- _Unit tests_: test whether one function or unit of code works as expected\n",
    "- _Docstring tests_: unit tests embedded in docstrings\n",
    "- _Integration tests_: test whether multiple functions work correctly together\n",
    "- _Smoke tests_: test whether a large piece of code crashes at an intermediate stage\n",
    "- _Regression tests_: tests whether your code is producing the same outputs that it used to in previous versions\n",
    "- _End-to-end tests_: literally a robot clicking buttons to figure out if your application works as expected\n",
    "\n",
    "The point is not to overwhelm you with the possibilities, but to give you a glossary of testing so you know what to look for when you're ready to dig deeper.\n",
    "\n",
    "## Write lots of tiny unit tests\n",
    "\n",
    "My proposal to you is modest:\n",
    "\n",
    "1. Isolate numeric code.\n",
    "2. Make numeric functions pure if practical.\n",
    "3. Write tests for the numeric code\n",
    "4. Write tests for the critical IO code\n",
    "\n",
    "You're going to get a lot of bang for your buck by writing unit tests - inline asserts and regression tests are also high payoff-to-effort. Aim for each unit test to run in 1 ms. The faster each test runs, the better for your working memory. More than 5 seconds and you'll be tempted to check your phone.\n",
    "\n",
    "What do you think is the ideal ratio of test code to real code?\n",
    "\n",
    "```{dropdown} Spoilers\n",
    "There's no ideal number per say, but 1:1 to 3:1 is a commonly quoted range for library code. For one-off code, you can usually get away with less test coverage. For more down-to-earth applications, 80% test coverage is a common target. [You can use the `Coverage.py` package to figure out your test coverage](https://coverage.readthedocs.io/en/coverage-5.3.1/).\n",
    "```\n",
    "\n",
    "## Now you're playing with power\n",
    "\n",
    "Testing is the key to refactor with confidence. Let's say that your code looks ugly, and you feel like it's time to refactor.\n",
    "\n",
    "1. Lock in the current behavior of your code with regression tests\n",
    "1. Check that the tests pass\n",
    "1. Rewrite the code to be tidy\n",
    "1. Correct the code\n",
    "1. Iterate until tests pass again\n",
    "\n",
    "You can call `pytest` with a specific filename to run one test suite. For a larger refactor, you can run all the tests in the current directory with:\n",
    "\n",
    "```\n",
    "$ pytest .\n",
    "```\n",
    "\n",
    "If you want, you can even integrate this workflow into github by running tests every time you push a commit! This is what's called _continuous integration_. It's probably overkill for a small-scale project, but know that it exists.\n",
    "\n",
    "## Discussion\n",
    "\n",
    "Writing tests is not part of common scientific practice yet, but I think it deserves a higher place in scientific programming education.\n",
    "\n",
    "Testing allows you to decrease the uncertainty surface of your code. With the right tests, you can convince yourself that parts of your code are _correct_, and that allows you to concentrate your debugging efforts. Keeping that uncertainty out of your head saves your working memory, and debugging will be faster and more efficient. At the same time, code with tests is less stressful to refactor, so you will be able to continuously improve your code so that it doesn't slide towards an unmanageable mess of spaghetti.\n",
    "\n",
    "Testing is not an all-or-none proposition: you can start writing lightweight inline tests in your code today. Find a commented out `print` statement in your code. Can you figure out how to replace it with an `assert`?\n",
    "\n",
    "```{admonition} 5-minute exercise\n",
    "Find a commented out `print` statement in your code and transform it into an `assert`.\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "exports": [
   {
    "format": "tex",
    "logo": false,
    "output": "exports/testing.tex",
    "template": "../templates/plain_latex_book_chapter"
   }
  ],
  "jupytext": {
   "formats": "md:myst",
   "text_representation": {
    "extension": ".md",
    "format_name": "myst"
   }
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "source_map": [
   17,
   57,
   63,
   290,
   308,
   335,
   344,
   350,
   357,
   361,
   377
  ],
  "title": "Testing your code"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}