I have a thing for Python and pandas when it comes to data analysis. However, my friend Matt Groves is coordinating CSAdvent. So in sticking with the C# theme, I’m seeing what’s out there for data analysis in the C# space. I came across this blog post on the .NET blog that introduced a DataFrame and decided to explore this more. Microsoft.Data.Analysis
is available on the NuGet Gallery: NuGet Gallery | Microsoft.Data.Analysis 0.19.0!
In this adventure, we will explore the December 2021 Public Holidays data source from the Public Holidays Dataset on Azure Open Datasets. We are using Microsoft.Data.Analysis
to work with data in C#, hopefully with similar functionality to pandas with a C# mindset. Let’s see what we can find!
You can follow along by creating a .NET interactive notebook from scratch or check it out within my Graphmas repo.
If you are following along in your own notebook, you will need to bring in the Microsoft.Data.Analysis package with the following command:
#r "nuget: Microsoft.Data.Analysis, 0.19.0"
Disclaimer: I still haven’t found any non-Python examples for the Azure Open Datasets. So I used Python + pandas to import the data and then export it to a CSV. Our Python dataframe is known as hol_df
and may be referred to throughout this post. We are working with the December 2021 Public Holidays export. The code below will bring in the data from the exported holidays file.
using Microsoft.Data.Analysis;
// Load data like pd.read_csv()
var holidayData = DataFrame.LoadCsv("DecemberPublicHolidays.csv",separator:',',header:true);
Exploring the Data
When I first am dropped in a dataset, there are some metrics I like to explore – including row and column counts, descriptions, and info. Let’s look at how to do this with Microsoft.Data.Analysis
. When there are important differences, I will include how it differs from the output in pandas.
Counts
Let’s look at row and column counts.
// How many rows?
holidayData.Rows.Count
107
// How many columns?
holidayData.Columns.Count
7
The one thing I haven’t seen count-wise in Microsoft.Data.Analysis
that I do have in pandas is .shape
. This includes the number of rows and the number of columns. Calling .shape
on my pandas DataFrame returns this result:
(107, 6)
Notice that the pandas DataFrame shows 6 columns. The index column is not counted in its results. You should be aware of that when we look at the column details.
Column Details
When it comes to looking at the imported data, notice how the Columns
property of a DataFrame
includes the index
, type
, and value
. I’m excited to see that it imported the datetimes as a DateTime type.
// What are the columns?
holidayData.Columns
The pandas
equivalent for showing columns is: .columns
.
This is what the output looks like from pandas
:
Index(['countryOrRegion', 'holidayName', 'normalizeHolidayName', 'isPaidTimeOff', 'countryRegionCode', 'date'], dtype='object')
So the Microsoft.Data.Analysis
output gives us more to go on – including data type and sample data.
Information
We can learn about columns and their non-null observations by calling .info()
in pandas
. The output from pandas
looks like this:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 26952 to 27058
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 countryOrRegion 107 non-null object
1 holidayName 107 non-null object
2 normalizeHolidayName 107 non-null object
3 isPaidTimeOff 8 non-null object
4 countryRegionCode 91 non-null object
5 date 107 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 5.9+ KB
Let’s look at the .Info()
from Microsoft.Data.Analysis
.
holidayData.Info()
While the C# equivalent shows the values horizontal rather than vertical, the .Info()
output is close. However, notice that Python has different non-null values. I suspect C# is treating empty values as empty strings rather than nulls. Let’s look at the data.
Peeking at the Data
We can peek at the data in pandas
by using head()
. This is what the first 5 records looks like in Python:
Note: We generated the table by calling hol_df.head().to_html()
. Thanks for that help, pandas
!
How do we do peek at the data in C#? Let’s see if following the pattern makes sense.
holidayData.Head()
Error: (1,13): error CS7036: There is no argument given that corresponds to the required formal parameter 'numberOfRows' of 'DataFrame.Head(int)'
As of this writing there isn’t a default value for .Head()
in Microsoft.Data.Analysis
. Let’s pass in 5.
holidayData.Head(5)
There seems to be generated index values versus row numbers. The columns look fairly similar in values except for the isPaidTimeOff
column – which shows None
in pandas
and is an empty string in the C# output.
Much like there’s .head()
, there’s also .tail()
to look at the last entries. Microsoft.Data.Analysis
has .Tail()
, which also does not have a default value assigned to it. So if you are working with this, make sure to pass in the number of rows to return.
Looking at Values and Counts
In Python, we can call the column off the Data Frame. Can we do that in C#?
holidayData.countryOrRegion
Error: (1,13): error CS1061: 'DataFrame' does not contain a definition for 'countryOrRegion' and no accessible extension method 'countryOrRegion' accepting a first argument of type 'DataFrame' could be found (are you missing a using directive or an assembly reference?)
So dot-notation is out. Let’s try subscripting. In pandas
, that looks like this: hol_df['countryOrRegion']
.
holidayData["countryOrRegion"]
So we can use subscripting with double quotes. Can we get counts for each of the values?
Consider the following Python:
hol_df['countryOrRegion'].value_counts()
This is a sample of the output:
Sweden 7
Norway 5
Wales 4
Australia 4
Finland 4
...
Czech 3
Italy 3
Mexico 3
Denmark 2
Germany 2
Croatia 2
...
Ukraine 1
Belgium 1
Brazil 1
Now let’s look at this with .ValueCounts()
for DataFrameColumn
in Microsoft.Data.Analysis
.
holidayData["countryOrRegion"].ValueCounts()
Python’s results came pre-sorted. Can we sort the C# results? I don’t know if .ValueCounts()
is returning a DataFrame. Let’s try something.
holidayData["countryOrRegion"].ValueCounts().Sort("Counts")
Error: (1,46): error CS1061: 'DataFrame' does not contain a definition for 'Sort' and no accessible extension method 'Sort' accepting a first argument of type 'DataFrame' could be found (are you missing a using directive or an assembly reference?)
This confirms my suspicions that .ValueCounts()
is returning a DataFrame. But… wait a minute. That blog post mentioned a Sort method. This error message makes it sound like it isn’t supported, at least not as it blogged.
Thankfully, Microsoft.Data.Analysis
is open source. I did some digging to find where .Sort()
lives and found it in Microsoft.Data.Analysis/DataFrame.cs on GitHub. Sure enough, there’s no default value on the isAscending
parameter. To make things more confusing, I could use OrderBy()
with just the column name, where the isAscending
is defaulted to true
. Or I can call OrderByDescending()
, where the isAscending
is defaulted to false
. So let’s sort our value counts:
holidayData["countryOrRegion"].ValueCounts().OrderByDescending("Counts")
Conclusion
The more I looked at the GitHub repo, the more I realize that Microsoft.Data.Analysis
is part of the ML.NET data prep story, based on this issue with a question on Microsoft.Data.Analysis package development. So it’s very much a work in process. I’m going to stay tuned to the ML.NET data prep / wrangling / exploration story, as I would love to see a solution in C# similar to the pandas
capabilities in Python.
3 Replies to “Exploring Microsoft.Data.Analysis for C# in a .NET Interactive Notebook”