Validate Data With Pydantic
Easily Validate User-Generated Data Using Pydantic
How to us Pydantic to validate Excel data
Using Pydantic to validate Excel data
As a data engineer, I frequently encounter situations where I have built pipelines and other automations based on user-generated data from Excel. Excel’s flexibility allows it to be used by a wide variety of users, but unfortunately, that flexibility leads to invalid data entering the pipeline. Before I discovered Pydantic, I wrote incredibly complex Pandas functions to check and filter data to make sure it was valid.
What is Pydantic?
Pydantic is a Python library that lets you define a data model in a Pythonic way, and use that model to validate data inputs, mainly through using type hints.
Sample data:
Before we get going, let’s examine our sample data; a spreadsheet of RPG characters I created using random name generators:
Column headers in bold indicate required fields, and we can see that some of the required fields are missing. Also, some of our text fields, like Race
, Class
and Gender
should be restricted to specific words instead of being free-text fields.
Starting our data model
Pydantic has a number of starting points for a data model, but ours is pretty simple so we are going to use pydantic.BaseMode
l:
import pydantic
class RpgCharacterModel(pydantic.BaseModel):
DATE: datetime
NAME: str
GENDER: str
RACE: str
CLASS: str
HOME: str
GUILD: str
PAY: int
The syntax is pretty simple. After defining the class and inheriting from our base model, we enter each of our field names and provide a type hint.
While this is a good start, this model has some severe limitations. Right now it will accept any string for a number of fields that we would prefer to be restricted to certain values.
Enums to control string fields
Type hints in Pydantic are more powerful when working with other class types in Python. One of these powerful features is being able to limit string entries by defining Enums and passing the Enum as a type hint.
Now we define those choices as Enums:
import enum
class GenderEnum(enum.Enum):
M = 'M'
F = 'F'
NB = 'NB'
class ClassEnum(enum.Enum):
Druid = 'Druid'
Fighter = 'Fighter'
Warlock = 'Warlock'
Ranger = 'Ranger'
Bard = 'Bard'
Sorcerer = 'Sorcerer'
Paladin = 'Paladin'
Rogue = 'Rogue'
Wizard = 'Wizard'
Monk = 'Monk'
Barbarian = 'Barbarian'
Cleric = 'Cleric'
class RaceEnum(enum.Enum):
Human = 'Human'
Dwarf = 'Dwarf'
Halfling = 'Halfling'
Elf = 'Elf'
Dragonborn = 'Dragonborn'
Tiefling = 'Tiefling'
Half_Orc = 'Half-Orc'
Gnome = 'Gnome'
Half_Elf = 'Half-Elf'
Now, we can use these enums to be much more specific with our data model:
class RpgCharacterModel(pydantic.BaseModel):
DATE: datetime
NAME: str
GENDER: GenderEnum
RACE: RaceEnum
CLASS: ClassEnum
HOME: str
GUILD: str
PAY: int
Our data model is looking better, but there are two more things we need to do to really tighten it up.
- Specify required fields
- Specify a range for pay. Numbers are a common place for data entry mistakes
Pydantic offers an optional function that lets us define even more specificity in our data model.
Field() for even greater specificity.
When defining a field in our data model, we can call the function Field() to specify additional options, including whether or not a field is required, and setting up limitations on numeric inputs.
import pydantic
class RpgCharacterModel(pydantic.BaseModel):
DATE: datetime
NAME: str = pydantic.Field(...)
GENDER: GenderEnum
RACE: RaceEnum = pydantic.Field(...)
CLASS: ClassEnum = pydantic.Field(...)
HOME: str
GUILD: str
PAY: int = pydantic.Field(..., ge=1, le=500)
Passing ...
as the first argument to Field indicates that it is required.
- the
ge
keyword meansGreater than or equal to
- the
le
keyword meansLess than or equal to
Using the model
Now we’ve done all the work to define the model, we need to use it.
Pydantic models expect to receive JSON-like data, so any data we pass to our model for validation must be a dictionary.
For our data validation, we need to do the following:
- Receive a DataFrame as an input
- Convert it to a list of dictionaries (one dictionary per row)
- Run each row through data validation
- Append successfully validated rows to one list
- Append failed rows to another list, with the row number and the error message
import pandas as pd
import pydantic
def validate_df_data(df: pd.DataFrame, model: pydantic.BaseModel, index_offset: int = 2) -> tuple[list, list]:
# Python index starts at 0, excel at 1, and 1 row for the header in Excel
#capturing our good data and our bad data
good_data = []
bad_data = []
df_rows = df.to_dict(orient='records')
for index, row in enumerate(df_rows):
try:
model(**row) #unpacks our dictionary into our keyword arguments
good_data.append(row) #appends valid data to a new list of dictionaries
except pydantic.ValidationError as e:
# Adds all validation error messages associated with the error
# and adds them to the dictionary
row['Errors'] = [error_message['msg'] for error_message in e.errors()]
row['Error_row_num'] = index + index_offset
bad_data.append(row) #appends bad data to a different list of dictionaries
return (good_data, bad_data)
Using this function, we can process the good rows of data, and return the bad rows of data for QA, modification, and resubmission.
Full code
# Standard Library imports
from datetime import datetime
import enum
# 3rd Party package imports
import pandas as pd
import pydantic
# Enums for limiting string data in our model
class GenderEnum(enum.Enum):
M = 'M'
F = 'F'
NB = 'NB'
class ClassEnum(enum.Enum):
Druid = 'Druid'
Fighter = 'Fighter'
Warlock = 'Warlock'
Ranger = 'Ranger'
Bard = 'Bard'
Sorcerer = 'Sorcerer'
Paladin = 'Paladin'
Rogue = 'Rogue'
Wizard = 'Wizard'
Monk = 'Monk'
Barbarian = 'Barbarian'
Cleric = 'Cleric'
class RaceEnum(enum.Enum):
Human = 'Human'
Dwarf = 'Dwarf'
Halfling = 'Halfling'
Elf = 'Elf'
Dragonborn = 'Dragonborn'
Tiefling = 'Tiefling'
Half_Orc = 'Half-Orc'
Gnome = 'Gnome'
Half_Elf = 'Half-Elf'
class RpgCharacterModel(pydantic.BaseModel):
DATE: datetime
NAME: str = pydantic.Field(...)
GENDER: GenderEnum
RACE: RaceEnum = pydantic.Field(...)
CLASS: ClassEnum = pydantic.Field(...)
HOME: str
GUILD: str
PAY: int = pydantic.Field(..., ge=1, le=500)
def validate_df_data(df: pd.DataFrame, model: pydantic.BaseModel, index_offset: int = 2) -> tuple[list, list]:
# Python index starts at 0, excel at 1, and 1 row for the header in Excel
#capturing our good data and our bad data
good_data = []
bad_data = []
df_rows = df.to_dict(orient='records')
for index, row in enumerate(df_rows):
try:
model(**row) #unpacks our dictionary into our keyword arguments
good_data.append(row) #appends valid data to a new list of dictionaries
except pydantic.ValidationError as e:
# Adds all validation error messages associated with the error
# and adds them to the dictionary
row['Errors'] = [error_message['msg'] for error_message in e.errors()]
# Python index starts at 0, excel at 1, and 1 row for the header in excel
row['Error_row_num'] = index + index_offset
bad_data.append(row) #appends bad data to a different list of dictionaries
return (good_data, bad_data)
df = pd.read_excel('sample_dnd_character_data.xlsx')
valid_data, invalid_data = validate_df_data(df, RpgCharacterModel, index_offset=2)
All code plus sample data can be found in my GitHub repo
Conclusion and next steps
Pydantic is exceptionally powerful, and while this was a simple example, it can handle complex nested models. This really allows a lot of granularity with data validation without writing a ton of code.
As an additional benefit, modeling the data really helps you understand it, rather than just taking whatever is thrown at you.
While this tutorial focused on Pandas, you can use Pydantic to validate most forms of data inputs with Python.