Python: Structuring your projects#
This chapter will examine what makes a good project structure, i.e., the decisions you make concerning how your project best meets its objective. In practical terms, structure means making a clean code with clear logic, explicit dependencies, and how the files and folders are organized in the filesystem.
This section looks closely at Python’s modules and import systems as they are the central elements to enforcing structure in your project. We then discuss various perspectives on building code that one can extend and test reliably.
Overall there are four major elements that a repository needs to contain:
the source code,
some data (external files, sample data, etc.),
the various tests (unit tests) to ensure the code runs as expected,
and the documentation.
Source code#
In any language, a project contains several source code files organized into logical units or modules. These could be single files or a collection of them in different directories.
Thanks to how Python handles imports and modules, it is relatively easy to structure a Python project. There are only a limited set of constraints to structure a module in python. Therefore, you can focus on the pure architectural task of crafting the different parts of your project and their interactions. Python modules are one of the main abstraction layers and a natural one. Abstraction layers allow separating code into components, holding related data and functionality together.
For example, one component can handle reading and writing the data from/to a remote database, while another deals with the complex training of your neural network model. The most natural way to separate these two is to regroup all I/O functionality in one file, and the training operations in another. The training file may import the I/O file through import ...
or from ... import A, B, C
.
As soon as you use import
statements, you use modules. These can be either built-in modules (e.g., os
, sys
, math
), third-party modules you have installed in your environment (e.g., numpy
, astropy
, pandas
), or your project’s internal modules.
In Python, a module is a file or a folder containing Python definitions and statements. In the case of a file, the filename is the module name with the suffix .py
.
Tip
The Python style guide recommends to keep module names short, lowercase, and be sure to avoid using special symbols like the dot (.) or question mark (?).
Avoid a file name like my.super.module.py
which will interfere with the way Python looks for modules!
The import module mechanism#
Aside from some naming restrictions, Python does not require anything special for a Python file to be a module, but one needs to understand the import
mechanism to avoid some pitfalls. For a directory, there are some constraints that we will detail later.
The import mymodule
statement looks for a mymodule.py
file in the same directory as the caller. If it does not exist, the Python interpreter will search the python path recursively. Finally, it will raise an ImportError
exception if it cannot be located.
import mysupermodule # ModuleNotFoundError if py >= 3.6 (ImportError otherwise)
Note
Python provides built-in exceptions with fine details.
During the import
step, python will tell you if there is an issue with the install package (ImportError
) or if the module is nowhere to be found (ModuleNotFoundError
). This difference allows you to deal differently in your code between the two cases.
Also note that ModuleNotFoundError
is derived from ImportError
.
When Python locates the file, the interpreter will execute it in an isolated scope (namespace). This execution includes any top-level statement (e.g. other imports) and function and class definitions. These are stored in the module’s dictionary.
import math
math.__dict__
Then, the module’s variables, functions, and classes will be available to the caller through the module’s namespace. (e.g., math.cos
)
Note
Within a module, the module’s name (as a string) is available as the value of the global variable name.
Warning
Avoid from A import *
. This is considered bad practice.
Using import *
makes the code hard to read and the dependencies less compartmentalized.
Let’s create a file called hello.py
in which we code the usual helloworld
function.
%%file hello.py
def helloworld():
print("Hello, world!")
Writing hello.py
If our current working directory contains this file, we can import hello
, and Python will automatically find it.
import hello
hello.helloworld()
Hello, world!
The Python Code Style section emphasizes that readability is one of the main features of Python, i.e., avoiding useless boilerplate text and clutter. Being able to tell immediately where comes from a class or function improves the code readability and understandability of a project dramatically.
However, we do not want to work with many python files or manually set the python path. We also sometimes need complex organization.
Packages in Python#
Python provides a straightforward packaging system that extends the module mechanism to a directory.
Any directory with an __init__.py
file is considered a Python package. The package’s modules (python files) obey the import rules mentioned before. Only the __init__.py
file has a particular behavior as it gathers all package-wide definitions (e.g., special values, documentation): the python interpreter always imports (execute) it with the package.
To import a file hello.py
in the directory examples/
, you need
import examples.hello
The interpreter will look for examples/__init__.py
and first execute its content. Then it will look for examples/hello.py
and execute its top-level statements. After these operations, any variable, function, or class is available in the example.hello
namespace.
Complex projects may have sub-packages and sub-sub-packages in a deep directory structure.
As soon as __init__.py
exists in a sub-directory of your package, Python will see it as a subpackage:
packagename
├── __init__.py
├── foo.py
├── hello.py
| ...
└─── subpackage/
├── __init__.py
├── submodule.py
└─── ...
Importing a single item from a sub-sub-package will require executing all __init__.py
files down to the module.
Leaving an `init.py`` file (almost) empty is considered standard and even good practice if the package’s modules and sub-packages do not need to share any code.
Lastly, Python provides a convenient syntax for importing deeply nested packages: import very.deep.module as mod
. The as
command allows you to use aliases (here mod
) instead of the long list of packages.
Structure of a Repository: code, documentation, tests#
Just as with code style, it is essential to have an organized repository structure. In this section, a repository means the root folder of your project.
Tip
Create a Virtual Environment
To keep your project space running, it’s a good idea to create a virtual environment.
You can use the venv
module of Python and specify the Python version and the environment name.
In a nutshell, virtual environments allow you to:
Maintain dependencies isolated. This avoids situations where you have projects using different package versions and you globally uninstalling/reinstalling what you need every time you need to run a project.
Share your dependencies with other people.
The source code#
Projects can be complex, with various sub-packages and other dependencies.
Good practice currently recommends to store your code in a src/packagename
directory.
With this practice you isolate all source code from other project files.
In astronomy, a code is most commonly a series of steps to analyze some data. Often, these codes become copy-paste templates, with repeated duplicated pieces where variables may be tweaked locally by hand. These codes are hard to maintain and almost impossible to reuse by others. Notebooks that recently became very popular are somewhat shifting the difficulties, mixing code and analysis as bookeepers.
A good practice consists in at least three aspects:
A specific task should be a function (which could be split into smaller functions) that do a particular job and called by other parts of the code.
Data files should be separate from functionalities.
Project input parameters should be separate from functionality and propagated as function arguments.
If one follows these three objectives, their code becomes much more straighforward to follow, maintain, and reuse. When tasks repeat across projects, this is a good indication that the assocaited functions would be best stored as a library or package.
Finally, a well organized project often reduces to a single python script calling varied small number of functions given some input parameters, for instance:
parse command line arguments
read datafile
call library to process data given arguments from user
produce output results as new files or plots
Good practice#
There are common best practices in coding in Python, in particular the PEP 8 style guide for Python code, which includes guidelines for formatting, naming conventions, and code structure. To explicit a few important practices:
Use descriptive names for variables, functions, and classes that accurately convey their purpose. Avoid using
a
orvar1
and prefer usingmass_function_index
orlayer1_opacity
.Write modular code that is easy to read, test, and maintain. Use functions and classes to encapsulate code and abstract away implementation details. It is easier to debug a code that fits on your screen. As soon as your code is more than 100 lines, you may want to write sub-functions.
Write clear and concise docstrings that explain what your code does and how to use it. Do not forget to explicit units or your input variables and their types.
Handle exceptions gracefully and provide informative error messages when something goes wrong. Avoid letting your code crash when you know it could happen at a particular stage. Instead, use
try-except
statements and provide helpful error messages.Write unit tests to verify the correctness of your code and catch regressions. It is very easy to mistakenly add a typo in your code, regression tests and unit tests avoid leaving them undetected.
Avoid using global variables. It can have complex side effects, as they can make your code harder to follow and test.
Optimize your code for readability rather than performance, unless performance is critical. But it is easier to optimize a code that works than debugging an unreadable optimized code.
Do not hesitate to refactor your code to improve its readability, maintainability, and extensibility.
Use version control (such as
git
) to track changes to your code and collaborate with others.
Warning
Some signs of a poorly structured project
Circular dependencies: if you have a classes Star and Planet in two different modules (or files) but you need to import Planet to answer
star.get_planetary_system()
and similarly import star to answerplanet.which_star()
then you have a circular dependency. In this case you will have to resort to fragile hacks such as using import statements inside your methods or functions.Hidden coupling: if changes on a given class always break tests on a different class (e.g. changes in Planet breaks many tests on Star) this means the Star class heavily rely on knowing the details about Planet.
Heavy usage of global state or context: it is very tempting to avoid an explicit list of arguments to some functions and instead rely on global variables that can be modified and are modified on the fly by different agents. This practice is very common in notebooks for instance. This is bad practice as you need to scrutinize all access to these global variables to understand which part changes what.
Spaghetti code: multiple pages of nested if clauses and for loops with a lot of copy-pasted procedural code and no proper segmentation are known as spaghetti code. Python’s meaningful indentation makes it very hard to maintain this kind of code.
Ravioli code: it consists of hundreds of similar little pieces of logic, often classes or objects, without proper structure. If you never can remember, if you have to use
List
,Tuple
,np.array
,pd.DataFrame
for your task at hand, then you might be swimming in a ravioli code.
README
file#
The first piece of documentation a user may encounter is the README
file. The README
is a simple text file (.txt
), but most commonly for software projects, it is in a Markdown (.md
) or RestructuredText (.rst
) file, which lives in the root directory of the project.
A minimal Python README
file should include the following content:
Project title and a description
Installation instructions
Usage instructions or a link to examples and documentation
Credits and Acknowledgments
In addition, it is good to have the following
Contributing Guidelines, which tells others how they can contribute to your project (it could also be in a separate file
CONTRIBUTING.md
).Troubleshooting and support, which explains how you provide support (if any) with your code. This can also be hosted directly by your subversion repository platform such as GitHub issues.
Changelog, which indicates the list of changes with time. It can be automated with subversion systems with the comit history and optionally stored in a separted file
CHANGELOG
.Sometimes you may want to indicate the roadmap (or todolist).
The
README
should also contain any links and related resources: e.g., documentation pages, related packages, papers to cite, etc.
Source Code Documentation#
This section covers the basics of how to create documentation.
Documenting your Python source code is important for making it more understandable, maintainable, and reusable. There are several different ways to document codes, and anyone should use them (all).
Tip
Code documentation aims at primarily documenting the design and purpose, not the internal mechanics.
It explicits interfaces and reasons, not the implementations.
Documentation is like an academic paper, it contains relevant information and references (sometimes figures).
It is good to refactor codes to explain better how it works: i.e., rather than 1 000 lines of complex code that need a long paragraph to explain it, it is better to reorganize it into smaller functions or elements with one line explanations. This may not always be possible.
Consider the documentation as part of your sofware code and keep it up to date. It increases the chances that you or anyone will update the documentation with any code update.
Docstrings#
Docstrings are triple-quoted strings that appear as the first statement in a module, function, or class definition. Use them to document your functions, classes, and modules. They should provide a concise description of what the function, class, or module does, its inputs and outputs, and any other relevant information (exception, side effect, references, etc.).
The standard for docstrings in Python is defined in PEP 257 and 287. The former defines the conventions for docstrings in Python, while the latter describes how to use docstrings to create metadata for modules, functions, classes, and methods.
The format of docstrings should follow the following conventions (in that order):
The first line should be a brief summary of what the function, module, or class does.
The second line should be blank.
The following lines should provide more detailed information about the function, module, or class.
the arguments: if the function, module, or class takes any arguments, they should be listed in the docstring.
the returned values: if the function, module, or class returns a value, the return value should be described in the docstring.
the raised exceptions: if the function, module, or class raises an exception, it should be described in the docstring.
Any additional notes, references, or related function can be provided in the docstring.
When writing docstrings, it is important to keep in mind what other users need to know (or yourself in six month time.)
In practice, docstrings can be formatted in several ways depending on personal preference and specific use cases. Here are some common formatting styles examples for a function that adds two numbers:
This style is used in scientific computing and data analysis projects, particularly those that use the NumPy library. It includes a one-line summary, a blank line, a detailed description, a “Parameters:” section listing the function’s arguments, and a “Returns:” section describing the function’s return value.
def add_numbers(a, b):
"""
Return the sum of two numbers.
Parameters
----------
a : int
The first number.
b : int
The second number.
Returns
-------
int
The sum of a and b.
Raises
------
ValueError: If a or b is not an integer.
"""
if not isinstance(a, int) or not isinstance(b, int):
raise ValueError("Both inputs must be integers.")
This style is popularized by Google and is widely used in many Python projects. It includes a “Args:” section listing the function’s arguments, and a “Returns:” section describing the function’s return value.
def add_numbers(a, b):
"""
Return the sum of two numbers.
Args:
a (int): The first number.
b (int): The second number.
Returns:
int: The sum of a and b.
Raises
ValueError: if a or b is not an integer.
"""
if not isinstance(a, int) or not isinstance(b, int):
raise ValueError("Both inputs must be integers.")
return a + b
Sphinx is a popular documentation tool for Python and other programming languages. It uses a specific style for docstrings, which is similar to the others, but offers a richer content with examples and notes.
def add_numbers(a, b):
"""
Return the sum of two numbers.
:param a: The first number.
:type a: int
:param b: The second number.
:type b: int
:return: The sum of a and b.
:rtype: int
:raises ValueError: If a or b is not an integer.
Example::
>>> add_numbers(2, 3)
5
"""
if not isinstance(a, int) or not isinstance(b, int):
raise ValueError("Both inputs must be integers.")
return a + b
Sphinx can handle the other two styles with additional plugins they provide.
Important
Python uses restructuredText formatting for the docstrings.
By following the standard for docstrings in Python, you can make your code more readable, maintainable, and easier to use for other developers who may use or modify your code in the future. But additionally, you can generate project documentation automatically, see for example the Matplotlib or Astropy project.
Tip
Many text editors will come with an extension that helps you generate docstrings. For example with VSCode you can use autoDocstring.
Docstrings for the users#
Unlike block comments, docstrings are built into the Python language itself. Python provides powerful introspection capabilities to access docstrings at runtime. Python stores docstrings in the __doc__
attributes of any variable, which a user can access directly or through the help()
function.
import os
help(os.mkdir)
Help on built-in function mkdir in module posix:
mkdir(path, mode=511, *, dir_fd=None)
Create a directory.
If dir_fd is not None, it should be a file descriptor open to a directory,
and path should be relative; path will then be relative to that directory.
dir_fd may not be implemented on your platform.
If it is unavailable, using it will raise a NotImplementedError.
The mode argument is ignored on Windows.
More generally, the Python inspect
module provides powerful tools to look into existing modules, and it can even provide and analyze the source code.
import inspect
import csv
print(inspect.getsource(csv)[:300], end='\n\n----\n\n') # it's a string!
print(inspect.getdoc(csv.reader), end='\n\n----\n\n')
print(inspect.isfunction(csv.reader))
"""
csv.py - read/write/investigate CSV files
"""
import re
from _csv import Error, __version__, writer, reader, register_dialect, \
unregister_dialect, get_dialect, list_dialects, \
field_size_limit, \
QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONNUMERIC,
----
csv_reader = reader(iterable [, dialect='excel']
[optional keyword args])
for row in csv_reader:
process(row)
The "iterable" argument can be any object that returns a line
of input for each iteration, such as a file object or a list. The
optional "dialect" parameter is discussed below. The function
also accepts optional keyword arguments which override settings
provided by the dialect.
The returned object is an iterator. Each iteration returns a row
of the CSV file (which can span multiple input lines).
----
False
Regression Testing and unit tests#
Python unit testing is the process of testing individual components, aka units, of a software system in isolation to ensure they function as expected. A unit can be a function, a method, a class, or a module. A unit test can also check the entire end-to-end behavior of a code.
Tests are an absolutely critical element of a good software project. You could write the most beautiful code and yet it could be irreparably broken a year later. There are multiple advantages to automate testing as long as you write the tests in parallel to the development process. It allows to detect issues early into the development, and to change the source code safely as any change or update to the code may lead to breaking the correctness of the code. This also helps in collaborative development projects where multiple authors could change the code at various places.
By testing the separate components, aka smallest logical units of your code, you can make robust codes and pin-point more easily where an issue occurs.
Here are some key concepts to keep in mind when using pytest for unit testing in Python:
Test Functions are functions that use the pytest syntax to test individual units of code. These functions must start with
test_
and can use assert statements to check if the expected results.Fixtures are functions that provide test data to test functions. They can be used to set up preconditions or prepare test data for the tests to be run.
Test Classes are classes that use the pytest syntax to group together related test functions. They can be used to set up common fixtures or test data for the test functions within the class.
Test Discovery is the mechanism of pytest to automatically discover and run tests in files with names starting with
test_
or in directories namedtests
. It can also discover tests in other files or directories specified on the command line.Test Driven Development (TDD) is the process by which each new functionality that your application must have is first a unit test before implementing the behavior fully. TDD may seem like developing applications backwards, but it has strong benefits: it ensures that you won’t overlook unit testing for some feature; it helps you to focus first on the tasks a certain function should achieve, and only afterwards to deal with how to implement those.
Tip
Check your main assumptions with assertions. (e.g. input values or types).
Use an off-the-shelf unit testing library (e.g. pytest), they are powerful and robust.
Turn bugs into test cases. Bugs are detected because of expected ways to use the code. Turning these into tests, makes sure your bugs remain fixed.
Use a symbolic debugger (e.g.
pdb
) to explore and understand the code during a bug.
Using pytest
#
Pytest is a popular Python testing framework that makes it easy to write and run tests. It provides a simple syntax and a range of powerful features that make writing and running tests efficient and effective.
A first example is the following
%%file test_sample.py
# content of test_sample.py
def inc(x):
""" Increment by 1 """
return x + 1
def test_answer():
assert inc(3) == 5
Writing test_sample.py
!pytest
============================= test session starts ==============================
platform linux -- Python 3.9.19, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/astro_ds/astro_ds/astro_ds/chapters/python
collecting ...
collected 1 item
test_sample.py F [100%]
=================================== FAILURES ===================================
_________________________________ test_answer __________________________________
def test_answer():
> assert inc(3) == 5
E assert 4 == 5
E + where 4 = inc(3)
test_sample.py:9: AssertionError
=========================== short test summary info ============================
FAILED test_sample.py::test_answer - assert 4 == 5
+ where 4 = inc(3)
============================== 1 failed in 0.07s ===============================
The [100%]
refers to the overall progress of running all test cases. After it finishes, pytest then shows a failure report because func(3)
does not return 5
.
Test your code against your assumptions to make sure it works as expected. Also test what happens if inputs do not look as expected. Use assert
and raise
exceptions in your main code to finely report on issues. You can then test your code against specific error types:
%%file test_sample.py
import pytest
def f():
""" Dummy function raising SystemExit Exception"""
raise SystemExit(1)
def test_myexception():
with pytest.raises(SystemExit):
f()
!pytest
============================= test session starts ==============================
platform linux -- Python 3.9.19, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/astro_ds/astro_ds/astro_ds/chapters/python
collecting ...
collected 1 item
test_sample.py . [100%]
============================== 1 passed in 0.01s ===============================
The above test concludes on passed
, which means the behavior is consistent with what we expected, i.e. the correct Exception was raised.
Grouping tests in classes can be beneficial for the following reasons:
Test organization,
Sharing fixtures for tests only in that particular class,
Applying marks at the class level and having them implicitly apply to all tests.
An example below:
%%file test_sample.py
class TestClassDemoInstance:
value = 0
def test_one(self):
self.value = 1
assert self.value == 1
def test_two(self):
assert self.value == 1
!pytest
============================= test session starts ==============================
platform linux -- Python 3.9.19, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/astro_ds/astro_ds/astro_ds/chapters/python
collecting ...
collected 2 items
test_sample.py .
F [100%]
=================================== FAILURES ===================================
________________________ TestClassDemoInstance.test_two ________________________
self = <test_sample.TestClassDemoInstance object at 0x7faae94eb310>
def test_two(self):
> assert self.value == 1
E assert 0 == 1
E + where 0 = <test_sample.TestClassDemoInstance object at 0x7faae94eb310>.value
test_sample.py:9: AssertionError
=========================== short test summary info ============================
FAILED test_sample.py::TestClassDemoInstance::test_two - assert 0 == 1
+ where 0 = <test_sample.TestClassDemoInstance object at 0x7faae94eb310>.value
========================= 1 failed, 1 passed in 0.07s ==========================
Warning
When grouping tests inside classes, each test has a unique instance of that class.
Having each test share the same class instance would be very detrimental to test isolation and would promote poor test practices.
To share values between tests, you can use class attributes
Comparing the output values of some components is often a critical part of testing. However, beware that machine precision does not guarantee that you always get the same output or the same rounding effects. Especially for comparing float values, one needs to specify the precision of the comparison using pytest.approx
:
%%file test_sample.py
import pytest
def test_one():
assert 2.2 == pytest.approx(2.3)
# fails, default is ± 2.3e-06
def test_two():
assert 2.2 == pytest.approx(2.3, 0.1)
# passes
def test_three():
# also works the other way, in case you were worried:
assert pytest.approx(2.3, 0.1) == 2.2
# passes
!pytest
============================= test session starts ==============================
platform linux -- Python 3.9.19, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/astro_ds/astro_ds/astro_ds/chapters/python
collecting ...
collected 3 items
test_sample.py
F.. [100%]
=================================== FAILURES ===================================
___________________________________ test_one ___________________________________
def test_one():
> assert 2.2 == pytest.approx(2.3)
E assert 2.2 == 2.3 ± 2.3e-06
E
E comparison failed
E Obtained: 2.2
E Expected: 2.3 ± 2.3e-06
test_sample.py:4: AssertionError
=========================== short test summary info ============================
FAILED test_sample.py::test_one - assert 2.2 == 2.3 ± 2.3e-06
comparison failed
Obtained: 2.2
Expected: 2.3 ± 2.3e-06
========================= 1 failed, 2 passed in 0.07s ==========================
Sometimes, tests require input data. For example a data sample that one needs to read in. At a basic level, test functions request fixtures by declaring them as arguments.
When pytest goes to run a test, it looks at the parameters in that test function’s signature, and then searches for fixtures that have the same names as those parameters. Once pytest finds them, it runs those fixtures, captures what they returned (if anything), and passes those objects into the test function as arguments.
For example
%%file test_sample.py
import pytest
class Fruit:
""" Simple fruit ingredient """
def __init__(self, name):
self.name = name
self.sliced = False
def slice(self):
self.sliced = True
class FruitSalad:
""" Combines Fruit objects """
def __init__(self, *fruit_bowl):
self.fruit = fruit_bowl
for fruit in self.fruit:
fruit.slice()
@pytest.fixture
def fruit_bowl_sample():
""" Gives some ingredients """
return [Fruit("apple"), Fruit("banana")]
def test_fruit_salad(fruit_bowl_sample):
# Act
fruit_salad = FruitSalad(*fruit_bowl_sample)
# Assert all are sliced
assert all(fruit.sliced for fruit in fruit_salad.fruit)
!pytest
============================= test session starts ==============================
platform linux -- Python 3.9.19, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/runner/work/astro_ds/astro_ds/astro_ds/chapters/python
collecting ...
collected 1 item
test_sample.py . [100%]
============================== 1 passed in 0.01s ===============================
In this example, test_fruit_salad
requests fruit_bowl_sample
. Pytests detects this requests, executes fruit_bowl_sample
fixture function (saves the result in cache) and passes the object to test_fruit_salad
as argument.
This fixture mechanism is very powerful and flexible. It allows us to boil down complex requirements for tests into more simple, “elegant”, and organized functions, where we only need to have each one describe the things they are dependent on.
One of the things that makes pytest’s fixture system so powerful, is that it gives us the ability to define a generic setup step that can be reused over and over, just like a normal function would be used. Two different tests can request the same fixture and have pytest give each test their own result from that fixture.
This is extremely useful for making sure tests aren’t affected by each other. We can use this system to make sure each test gets its own fresh batch of data and is starting from a clean state so it can provide consistent, repeatable results. Fixtures can also be requested more than once during the same test, and pytest won’t execute them again but put them in cache instead.
Tip
Best Practices
Write tests for parts that have the fewest dependencies on external resources first, and work your way up.
Write tests as logically simple as possible.
Each unit test should be independent of all other tests. (even if they use similar pieces of the main code.)
Unit tests are part of the code, they should have clear names and be well documented (i.e., docstrings).
All methods should have appropriate Python unit tests regardless of their visibility for the user.
Create unit tests that cover exceptions.
Write tests concurrently with code, and do not “leave tests for later”. (best way to never write tests otherwise.)
Never compare numerical values without explicitly adding a precision requirement (e.g.,
pytest.approx
)
A word about the LICENSE
#
Aside from the source code, the License is arguably the most critical part of your repository. The full license text and copyright claims should exist in this file.
Software licenses are legally binding agreements between the software provider(s) and the user(s). These licenses define the rights and responsibilities of all parties, including the scope and conditions of (re-)use of the software or its source code, the limitations of liability, and warranties and disclaimers. It is important to understand the terms of a software license before using the software, as failing to comply with the license can have legal consequences.
There are about five types of licenses from the most open to least:
Public domain. Copyright does not apply to works in the public domain. Anyone can modify and use such software without any restrictions. But actual public domain codes are rare and that the definition varies between jurisdictions. Examples include the CERN httpd in 1993 (discontinued), SQLite , and SHA-3 (Secure Hash Algorithm 3). The Openwall Project maintains a list of several algorithms and their source code in the public domain.
Permissive. Permissive licenses contain minimal restrictions on how the software can be modified or redistributed. They typically only require that the distribution of the software retains the copyright information in a notices file. This category of software license is the most popular open source license type. The best-known examples include the Apache License, the BSD License, and MIT License.
Weak copyleft. The GNU Lesser General Public License (or LGPL) is designed to allow linking to open source libraries with little obligation. If software dynamically links LGPL libraries, the entire work is free to be distributed under any license, even a proprietary ones, with minimal requirements. It is a bit more complicated for static linking and/or modifying the libraries. LGPL makes the copyleft obligations less restrictive (see below). Examples include the MPL, CDDL, and Eclipse.
Copyleft. Also known as reciprocal licenses or restrictive licenses. The most well-known and frequently used is the General Public License (GPL) family of licenses. These allow developers to modify the licensed code, incorporate it with proprietary code, and distribute new works based on it. The condition is that any new works keeps the same software license. It makes use of libraries with other licensing sometimes difficult. The catch here is that GPL-licensed code requires distribution of proprietary source code, i.e. these licenses require distribution of source code along with the new work. Examples include include the Linux operating system [1], the Apache web server, the GNU Compiler Collection (GCC), the GNU Bash shell, and the GNU Core Utilities. include the Linux operating system, the Apache web server, the GNU Compiler Collection (GCC) and most of GNU codes.
Commercial or proprietary. These vary a lot but tend to be the most restrictive, generally used for commercial software where the copyright holder expresses strict conditions. For instance, they forbid the code to be shared, reverse-engineered, modified, redistributed, or sold.
Unlicensed code: A code that doesn’t have an explicit license is not de facto in the public domain, it’s the extreme opposite. The default is that nobody can use a software without a license without risks of violating copyright laws.
An extensive list of licenses can be found on wikipedia Comparison of free and open source software licenses. If you aren’t sure which license to use for your project, check out choosealicense.com.
For astronomy projects, permissive licensing is the most appropriate. We recommend a “BSD-3-Clause license”, which allows private and commercial use, modification, or redistribution of the code with the preservation of the credits of the copyright holder. However, it removes your liability if the code does not work as expected. It is a permissive license similar to the BSD 2-Clause License but with a 3rd clause prohibiting others from using the copyright holder’s name or its contributors to promote derived products without written consent.
Of course, you can publish code without a license, but this would prevent many people from potentially using or contributing to your code.
Note
It is good practice to use the MANIFEST.in
file to ensure that your license remains always with your work after installation.
Warning
Do not invent your own license.
Licenses have tremendous legal details and complex wordings. The LICENSE
file corresponds usually to a simplified summary that links to the more global definitions.
MANIFEST.in
file#
The MANIFEST.in
file is a plain text file that specifies additional files to include in a source distribution or installation. For instance, ensuring your license propagates with the code, or that information about the authors/contributors remains. The MANIFEST.in
file also specifies which files should be excluded.
Typical content of python project MANIFEST.in
file:
include README.md LICENSE
More information in the Python documentation 4.1 Specifying the files to distribute.
Make your code pip
installable (packaging)#
Making your Python code pip installable means that anyone can run pip install .
from the source directory or pip install <packagename>
if published on PyPI to install your code.
Adapting a project is a relatively straightforward process. We will come back to PyPI later on (section below).
The project root directory should contain setup information, such as the name of the project, the version of the project, the license of the project, and the list of dependencies.
Let’s consider a package project with a LICENSE
, README.md
, and a small example code as follows:
packaging_tutorial/
├── LICENSE
├── README.md
└─── src/
└── example_package/
├── __init__.py
└── example.py
The __init__.py
is empty but required to import the directory as a package. The example.py
is a dummy module for example:
def add_one(number):
return number + 1
Note
There are various ways to package your project in python (using only setup.py
, or setup.cfg
, or pyproject.toml
).
Starting with PEP 621 (June 2020), the Python community selected pyproject.toml
as a
standard way of specifying project metadata. Setuptools has adopted this standard and will use the information contained in this file as an input in the build process.
There are still cases where we need to have setup.py
, particularly when we need to compile code for the package.
However, setup.cfg
tend to already be a story of the past, or a previous step in the Python evolution. We will therefore focus on pyproject.toml
pyproject.toml
#
pyproject.toml
tells a frontend build tools like pip
, which backend tool to use and its procedure to create distribution packages for your project. You can choose from a number of backends; this tutorial uses setuptools as default but but it will work identically with Hatchling, Flit, and others that support the [project] table for metadata.
Note
Some build backends are part of larger tools that provide a command-line interface with additional features like project initialization and version management, as well as building, uploading, and installing packages.
Select a build-system backend#
Let’s create a pyproject.toml
and setup a build-system
(TOML) table (identified by the
[table-header]
syntax):
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[build-system]
requires = ["flit_core>=3.4"]
build-backend = "flit_core.buildapi"
The requires
statement lists packages that are needed during the build phase of your package. You don’t need to install them; build frontends like pip
will install them automatically in a temporary, isolated virtual environment for use during the build process. These are not the dependencies (e.g. NumPy).
The build-backend
is the name of the Python object that the frontends will use to perform the build task.
You can read details on the Python online documentation and the setuptools tutorial.
Important
If compatibility with legacy builds or versions of tools that don’t support certain packaging standards, a simple setup.py
script can be added to your project to create a compatibility layer:
from setuptools import setup
setup()
Define the Project metadata#
Similar to the README
file, a python packaging needs to know about the project. You need at the very least:
name
, which is the distribution name of your package.version
, sets the package version which follows PEP440, i.e. commonly a string with three segmentsx.y.z
wherex
corresponds to the major version,y
a small increments andz
often an intermediate release version (for instance a git tag).authors
, which identifies the author of the package; you specify a name and an email for each author. You can also listmaintainers
in the same format.description
, which is a short, one-sentence summary of the package.readme
, to set a path to a file containing a detailed description of the package (the README file).requires-python
, to give the versions of Python supported by your project. Installers likepip
can handle looking back through older versions of packages until it finds one that has a matching Python version.classifiers
gives the index and pip some additional metadata about your package. You should always include at least which version(s) of Python, the license, and which operating systems. PyPI gives a list of classifiers in details.dependencies
which lists the package requirements to function.
[project]
name = "example_package"
version = "0.0.1"
authors = [
{name="Example Author", email="john.doe@whereiam.com" },
]
description = "A small example package"
readme = "README.md"
requires-python = ">=3.7"
keywords = ["one", "two"]
license = {text = "BSD-3-Clause"}
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
]
dependencies = [
"requests",
'importlib-metadata; python_version<"3.8"',
]
Optionally, you can provide keywords
and urls
to the code repository/documentation. Other common fields are keywords can improve discoverability of your package. Packaging.python.org provides a complete list of project metadata fields you can use in your pyproject.toml
file.
Define the code source information#
The setuptools backend will attempt to auto-discover your project, which works with simple projects, but it is recommended to explicitly set manually a few pieces of information as shown below:
[tool.setuptools.packages.find]
where = ["src"] # list of folders that contain the packages (["."] by default)
include = ["example_package*"] # package names should match these glob patterns (["*"] by default)
exclude = ["example_package.tests*"] # exclude packages matching these glob patterns (empty by default)
namespaces = true # to disable scanning PEP 420 namespaces (true by default)
Note that we can use glob patterns in the configuration as in the example above. This means that if you specify exclude = ["tests"]
, modules like tests.example.test1
will remain part of the distribution (unless you use tests*
).
Dynamic Metadata#
Sometimes it can be cumbersome to track metadata changes when specified at multiple places. The most common piece is the version
field of your package.
Setuptools provides mechanisms to set some fields as dynamic values, which are computed during the build by either setuptools itself or the plugins installed via build-system.requires
(e.g. setuptools-scm
is capable of deriving the current project version directly from the git version control system).
Currently the following fields can be listed as dynamic: version
, classifiers
, description
, entry-points
, scripts
, gui-scripts
, and readme
. You need to create a [tool.setuptools.dynamic]
table. For example:
# ...
[project]
name = "example_package"
dynamic = ["version", "readme"]
# ...
[tool.setuptools.dynamic]
version = {attr = "example_package.VERSION"}
readme = {file = ["README.md", "CONTRIBUTING.md"]}
Optional dependencies#
It is very common for Python packages to provide a set of required dependencies and some optional dependencies which depends on specific goals. For instance, you may need specific dependencies to run you test suite, or build the documentation.
[project]
#...
dependencies = [
"requests",
'importlib-metadata; python_version<"3.8"',
'numpy',
]
[project.optional-dependencies]
testing = [
"pytest",
"pytest-doctestplus",
"flake8",
"codecov",
"pytest-cov"]
docs = [
"sphinx",
"sphinx-automodapi",
"numpydoc"]
In the example above, pip install .[docs]
or pip install example_package[docs]
will result in installing numpy
and the additional listed packages in the docs
list, but not those in testing
.
Command line scripts#
It could happen that a python project provides a command line script (e.g. glueviz
, tqdm
). Setuptool can install those scripts at the correct locations. All it needs is a definition of what the scripts are, such as for instance:
[project.scripts]
my-script = "example_package.module:function"
Cython or compiled submodules: setup.py
#
This is the case where an explicit setup.py
is needed it is currently the only option to declare extension modules. pyproject.toml
or PEP518 never aimed to replace it. They complement each other.
First, you need to update the building requirements in your project configuration as follows (for Cython):
[build-system]
requires = ["setuptools>=45",
"setuptools-scm>=6.2",
# Cython extensions
"wheel",
"extension-helpers",
"Cython"
]
build-backend = "setuptools.build_meta"
And then assuming you have a Cython source code in src/example_package/include/hello_world.pyx
, the corresponding setup.py
becomes:
from setuptools import setup
from distutils.core import Extension
def get_extensions():
return [
Extension(name='example_package.compiled_hello_world',
sources=['src/example_package/include/hello_world.pyx'])
]
setup(ext_modules=get_extensions())
Where do I store my data?#
You should distinguish between application data and the necessary data your code need to run properly. Only the later should be part of your software repository.
Software data comprises configuration files, libraries, other packages. However, keep in mind that subversion does not usually work well (by default) with binary data.
Packaged codes should generally avoid fixed data files. Before you include data files (i.e., non-code files) in your projects, consider the following alternatives:
Is there an (authoritative) copy of this data elsewhere on the internet?
Is this data accessible via an API or a client library?
Do I need this whole file or a small piece of it will do?
Can I create an effective test with simulated data?
If you need minimal test data, maybe you can generate those or get them through API.
Eventually, it happens that you really need to include data with your package. The good practice recommend to store large dataset outside the software repository, for instance on ftp
, http
services, such as Zenodo.
We will see in below how to copy data with our code, when necessary, during the installation steps. Not all files are necessary at the installation stage. Your code could transfer them at runtime when needed.
Note
Zenodo is a free public platform that can store up to 50Go of data per project, and keeps detailed meta data (e.g. authors, projects, version) and even provides automatically registered DOI.
Copy files at runtime#
For resources which are available on the internet (e.g. archives, databases, observatory websites), you can use client libraries and either have your users manually download the files or make your code download them.
Avoid system dependent commands, such as wget
, curl
etc.
Instead use the built-it (system independent) Python urllib
module or packages built on it.
A quick example of urllib
-based download code
import urllib.request
urllib.request.urlretrieve(url, file_name)
Specify the data to copy at code installation#
If you need to include data files with your code, these should be stored in a data
(or a similarly explicit name) directory inside your package source code tree: e.g. src/package/data
.
By default, pyproject.toml
assumes that you will want to include data (if any) with the distribution of your package. However you might want to be specific in terms of which files to includes (maybe not all files). You need to specify in the pyproject.toml
where/what data files need to be distributed:
[tool.setuptools.package-data]
# Examples
# "*" = ["*.txt"]
example_package = ["*.txt", "*.rst"]
example_package.data = ["*.txt", "*.rst"] # for subfolder of example_package
Collaborative development and specific files#
If your project is a part of collaborative development (team, public), we strongly encourage to provide a few more guidelines in your repository.
Collaborative development is the fact or process that teams of developers work together on a common project. It is very similar to writing a paper with your colleagues where authors may contribute to the manuscript to various degrees. Collaboration encourages open communication, which leads to the development of innovative solutions and the generation of new research.
Collaborative development allows for increased efficiency, knowledge sharing, and problem solving. It also allows teams to quickly identify potential problems and create solutions that are practical and effective. Finally, it helps to foster a sense of shared responsibility and trust between team members, which is essential for the successful completion of projects.
This development process extends significantly when the project is publicly open-source. If you publish your code on GitHub, for instance, any one could contribute to the project, and bring their expertise. Regardless of publicly published, we detail below some specific guidelines to help collaborative project development.
CITATION.cff
#
Like any scientific paper, sotfware represent work that must be credited properly. Journals provide guidelines to citing software (e.g. aas guidelines.
This requires a proper reference to your project. As part of your README
file, you must provide how users should refer to your project. One advanced manner is to provide the CITATION.cff
file, which is heavily supported by GitHub. This file corresponds to a YAML (.yml
) file, which contains basic elements such as project title, released version, url, authors (File format and fields details)
An example of CITATION.cff
file below:
cff-version: 1.2.0
title: simple-website
message: If you use this software, please cite it using the metadata from this file.
date-released: 2022-04-20
url: 'https://github.com/mfouesneau/simple-website'
version: 0.1
type: software
authors:
- family-names: Fouesneau
given-names: Morgan
orcid: 'https://orcid.org/0000-0001-9256-5516'
Note
You can also use online generators if you do not want to manually define this file.
For example cffinit
.
CODE_OF_CONDUCT.md
#
As soon as you work in a team, you must have some implicit or explicit agreement on how individuals should behave in the group and respect each other. It is the same for collaborative code development.
You do not have to have a lengthy code of conduct, but be explicit. It is recommended to give concrete examples of what you find positive or negative behavior.
A code practice is to write a CODE_OF_CONDUCT.md
or CODE_OF_CONDUCT.rst
file at the root of your repository. You can check CODE_OF_CONDUCT.md
in this repository for an example.
Tip
Provide explicit consequences of negative behavior and how to report them.
CONTRIBUTING.md
#
You can provide explicit guidelines on how collaborators can contribute to your project. For instance, you may not want them to edit your code but provide patches only, maybe you want a specific review process.
The recommended good practice is to detail the various aspects of contributions in a CONTRIBUTING.md
or CONTRIBUTING.rst
file at the root of your project. You can see our file for an example.
SECURITY.md
(optional)#
For some projects, it may be important to have secure applications. Finding vulnerabilities may be part of a normal contribution. It is however recommended to be very explicit of how you expect feedback on security. GitHub now recommends a SECURITY.md
or SECURITY.rst
file at the root of your project.
Tip
The security file can be as simple as “please open an issue for any identified vulnerability”.
Sometimes you do not want these to be public and you may prefer a private communication.
Publish your project on PyPI#
PyPI (Python Package Index) is a central repository that hosts open-source Python packages and allows developers to easily download, install, and use them in their own projects.
PyPI is managed by the Python Software Foundation and provides a standardized interface for package management. Developers can upload their Python packages to PyPI, and users can search for packages based on keywords or categories, download them, and install them with tools like pip
.
PyPI has become a critical component of the Python ecosystem, as it provides an easy way for developers to share their code and collaborate with others. It has also made it easier for users to discover and use third-party packages in their own projects, reducing the need to reinvent the wheel and enabling faster development cycles.
Publishing a Python package on PyPI allows other developers to easily install and use your package in their own projects.
Here are the steps to publish a Python project on PyPI:
Create a source distribution of your Python project (in your project root directory):
python setup.py sdist
This will create a
.tar.gz
file in adist/
directory.Create a wheel distribution of your Python project:
python setup.py bdist_wheel
This will create a
.whl
file in thedist/
directory.Upload your distribution with
twine
:You may first need to install
twine
pip install twine
Upload your distribution files to PyPI using the twine command:
twine upload dist/*
This command will prompt you to enter your PyPI username and password. If you don’t have an account, you’ll need to create one on the PyPI website.
Your package is now published on PyPI and can be installed by other users using
pip
.Note
It may take a few minutes for your package to appear on PyPI and be usable with
pip
after you’ve first published it.PyPI will make sure the name of your package is unique and that the metadata (
pyproject.toml
) are correct.
Compiled documentation with Sphinx#
By following the standard for docstrings in Python, you can make your code more readable, maintainable, and easier to use for other developers who may use or modify your code in the future. But additionally, you can generate project documentation, see for example the Matplotlib or Astropy project.
One of the standard libraries in Python is Sphinx. It offers a rich content in addition to extracting the documentation from the source code. You can pip
install Sphinx with
pip install sphinx
The documentation is commonly in a docs/
directory at the root of your project.
Sphinx uses restructured text by default, which is just enough different from markdown to be confusing.
The documentation may have it’s own set of dependencies, if at least sphinx, and needs to be part of your pyproject.toml
as for example:
[project.optional-dependencies]
...
docs = [
"sphinx",
"sphinx-automodapi",
"numpydoc" # for using the numpy docstring conventions
]
To install these specific dependencies, you can then use the pip command:
pip install ."[docs]"
Sphinx comes with a sphinx-quickstart
script that sets up a source directory and creates a default conf.py
configuration file with the most useful configuration values from a few questions it asks you. To use this, run:
sphinx-quickstart
sphinx-quickstart
also creates a Makefile
and a make.bat
which make life even easier for you. These can be executed by running make
with the name of the builder. For example
make html
This will build HTML docs in the build directory you chose (default /docs/_build/html
).
Summary of typical Python project structure#
A typical python project contains the following:
project/
├── docs/
| ├── Make.bat
| ├── Makefile
| ├── conf.py
| └── index.rst
├── src/
| ├── packagename/
| | ├── __init__.py
| | └── example.py
| └── necessarydata/
├── tests/
├── CITATION.cff
├── CODE_OF_CONDUCT.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── SECURITY.md
├── pyproject.toml
└── setup.py
filename |
Description |
---|---|
|
The actual project source code |
|
data files necessary to run the code (limited) |
|
contains all the unit tests |
|
contains all the documentation |
|
defines rules for your versioning workflow (here git) |
|
defines rules on how your code can be used/modified/distributed |
|
sets files to be distributed with the source distribution (e.g. LICENSE, README) |
|
At least contains a project name and a description |
|
is the specified file format of PEP 518 which contains the build system requirements of Python projects. |
|
(not always) contains some specific configuration of the project that the toml file cannot handle (compiled codes etc) |
|
contains basic elements on how to cite your project |
|
contains guidelines on how individuals should behave in the group and respect each other |
|
contains how feedback on security should be provided |
Example of pyproject.toml
:
[build-system]
requires = ["setuptools>=45",
"setuptools-scm>=6.2",
]
build-backend = "setuptools.build_meta"
[project]
name = "my_package"
description = "Python project template for Gitpod and VSCode"
readme = "README.md"
requires-python = ">=3.8"
keywords = ["template", "python", "gitpod", "vscode"]
license = {text = "BSD 3-Clause License"}
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: BSD 3-Clause License",
"Operating System :: OS Independent"
]
dependencies = [
"requests",
'importlib-metadata; python_version<"3.8"',
]
dynamic = ["version"]
[tool.setuptools_scm]
# write_to = "src/my_package/__version__.py"
[tool.setuptools.packages.find]
where = ["src"] # list of folders that contain the packages (["."] by default)
include = ["my_package*"] # package names should match these glob patterns (["*"] by default)
exclude = ["my_package.tests*"] # exclude packages matching these glob patterns (empty by default)
namespaces = true # to disable scanning PEP 420 namespaces (true by default)
[tool.setuptools.package-data]
# Examples
# "*" = ["*.txt"]
#my_package = ["*.txt", "*.rst"]
my_package.data = ["*.txt", "*.rst"] # for subfolder of my_package
[project.optional-dependencies]
testing = [
"pytest",
"pytest-doctestplus",
"flake8",
"codecov",
"pytest-cov"]
ci = [
"toml",
"yapf"]
docs = [
"sphinx",
"sphinx-automodapi",
"numpydoc"]
!rm -f sample_script.py test_sample.py hello.py
Comments#
Comments should be used sparingly, and only for complex or non-obvious code (which should rarely be non-obvious.) Comments in source code are only meant for the developper(s), they should never contain critical functional information that a user might need.
Here’s an example of a comment (for obvious code):