Python

12 Python Features Every Data Scientist Should Know

Master the essential Python fundamentals

Benedict Neo
bitgrit Data Science Publication
6 min readMay 23, 2023

--

made with carbon (source)

As a data scientist, you’re no stranger to the power of Python.

From data wrangling to machine learning, Python has become the de facto language for data science. But are you taking advantage of all the features that Python has to offer?

In this article, we’ll take a deep dive into 12 Python features that every data scientist should know.

From comprehensions to data classes, these features will help you write more efficient, readable, and maintainable code.

Get the Code @ Deepnote notebook

1. Comprehensions

Comprehensions in Python are a useful tool for machine learning and data science tasks as they allow for the creation of complex data structures in a concise and readable manner.

List comprehensions can be used to generate lists of data, such as creating a list of squared values from a range of numbers.

Nested list comprehensions can be used to flatten multidimensional arrays, a common preprocessing task in data science.

Dictionary and set comprehensions are useful for creating dictionaries and sets of data, respectively. For example, dictionary comprehension can be used to create a dictionary of feature names and their corresponding feature importance scores in a machine learning model.

Generator comprehensions are particularly useful for working with large datasets, as they generate values on-the-fly rather than creating a large data structure in memory. This can help to improve performance and reduce memory usage.

2. Enumerate

enumerate is a built-in function that allows for iterating over a sequence (such as a list or tuple) while keeping track of the index of each element.

This can be useful when working with datasets, as it allows for easily accessing and manipulating individual elements while keeping track of their index position.

Here we use enumerate to iterate over a list of strings and print out the value if the index is an even number.

3. Zip

zip is a built-in function allowing iterating over multiple sequences (such as lists or tuples) in parallel.

Below we usezip to iterate over two lists x and y simultaneously and perform operations on their corresponding elements.

In this case, it prints out the values of each element in x and y, their sum, and their product.

4. Generators

Generators in Python are a type of iterable that allows for generating a sequence of values on-the-fly, rather than generating all the values at once and storing them in memory.

This makes them useful for working with large datasets that won’t fit in memory, as the data is processed in small chunks or batches rather than all at once.

Below we use a generator function to generate the first n numbers in the Fibonacci sequence. The yield keyword is used to generate each value in the sequence one at a time, rather than generating the entire sequence at once.

5. Lambda functions

lambda is a keyword used to create anonymous functions, which are functions that do not have a name and can be defined in a single line of code.

They are useful for defining custom functions on-the-fly for feature engineering, data preprocessing, or model evaluation.

Below we use lambda to create a simple function for filtering even numbers from a list of numbers.

Here’s another code snippet for using lambda functions with Pandas

Speaking of Pandas here are 40 snippets that’ll come in handy

6. Map, filter, reduce

The functions map, filter, and reduce are three built-in functions used for manipulating and transforming data.

map is used to apply a function to each element of an iterable, filter is used to select elements from an iterable based on a condition, and reduce is used to apply a function to pairs of elements in an iterable to produce a single result.

Below we use all of them in a single pipeline, calculating the sum of squares of even numbers.

7. Any and all

any and all are built-in functions that allow for checking if any or all elements in an iterable meet a certain condition.

any and all can be useful for checking if certain conditions are met across a dataset or a subset of a dataset. For example, they can be used to check if any values in a column are missing or if all values in a column are within a certain range.

Below is a simple example of checking for the presence of any even values and all odd values.

8. next

next is used to retrieve the next item from an iterator. An iterator is an object that can be iterated (looped) upon, such as a list, tuple, set, or dictionary.

next is commonly used in data science for iterating through an iterator or generator object. It allows the user to retrieve the next item from the iterable and can be useful for handling large datasets or streaming data.

Below, we define a generator random_numbers() that yields random numbers between 0 and 1. We then use the next() function to find the first number in the generator greater than 0.9

9. defaultdict

defaultdict is a subclass of the built-in dict class that allows for providing a default value for missing keys.

defaultdict can be useful for handling missing or incomplete data, such as when working with sparse matrices or feature vectors. It can also be used for counting the frequency of categorical variables.

An example is counting the frequency of items in a list. int is used as the default factory for the defaultdict, which initializes missing keys to 0.

10. partial

partial is a function in the functools module that allows for creating a new function from an existing function with some of its arguments pre-filled.

partial can be useful for creating custom functions or data transformations with specific parameters or arguments pre-filled. This can help to reduce the amount of boilerplate code needed when defining and calling functions.

Here we use partial to create a new function increment from the existing add function with one of its arguments fixed to the value 1.

Calling increment(1) is essentially calling add(1, 1)

11. lru_cache

lru_cache is a decorator function in the functools module that allows for caching the results of functions with a limited-size cache.

lru_cache can be useful for optimizing computationally expensive functions or model training procedures that may be called with the same arguments multiple times.

Caching can help to speed up the execution of the function and reduce the overall computational cost.

Here’s an example of efficiently computing Fibonacci numbers with a cache (known as memoization in computer science)

Speaking of decorators, you’ll find a decorator to time your Python code in our recent article below 👇

12. Dataclasses

The @dataclass decorator automatically generates several special methods for a class, such as __init__, __repr__, and __eq__, based on the defined attributes.

This can help to reduce the amount of boilerplate code needed when defining classes. dataclass objects can represent data points, feature vectors, or model parameters, among other things.

In this example, dataclass is used to define a simple class Person with three attributes: name, age, and city.

That’s all for this article.

Did I miss out on any other must-know features? Leave them in the comments below!

Want more?

Check out our guide on data cleaning using Python.

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--