Amherst College I.T. I.T. home. Amherst College.
I.T. home.
ATS > Software > Scientific Programming with Python > Demographic Data

Scientific Programming with Python

Part 5: Demographic Data

Previous: Simulating a Magnet

Current: Demographic Data

 


In the natural sciences, tabular data results from all kinds of measurements in physics, chemistry, biology, geology, and environmental science.

In the quantitative social sciences, from economics to sociology and psychology, survey and demographic data provide the background necessary to understand relationships between people, institutions, and geography.

The Pandas package makes it much easier to work with such data in Python, and the statsmodels package provides statistical tools to analyze your data and basic modeling. The Scikit-Learn library provides additional tools that can build machine-learning models for predictive purposes.

Topics


Working with Tabular Data

The Pandas module provides new data types and specialized features for expressive handling of tabular data, even multidimensional data.


Ames Housing

As an example dataset, let’s consider a collection of data about housing in Ames, Iowa between 2006 and 2010, which you can view at this site: http://jse.amstat.org/v19n3/decock/AmesHousing.txt. Since it’s a text file, you’ll initially see the entire dataset:

Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	
Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood
Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond		
Year Built	Year Remod/Add	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	
Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	
Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	
Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	
1st Flr SF	2nd Flr SF	Low Qual Fin SF	Gr Liv Area	Bsmt Full Bath	Bsmt Half Bath	
Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	
Functional	Fireplaces	Fireplace Qu	Garage Type	Garage Yr Blt	Garage Finish	
Garage Cars	Garage Area	Garage Qual	Garage Cond	Paved Drive	Wood Deck SF	
Open Porch SF	Enclosed Porch	3Ssn	Porch	Screen Porch	Pool Area	Pool QC	Fence	
Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
1	0526301100	020	RL	141	31770	Pave	NA	IR1	Lvl	
AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	
6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone
112	TA	TA	CBlock	TA	Gd	Gd	BLQ	639	Unf
0	441	1080	GasA	Fa	Y	SBrkr	1656	0	0	1656	
1	0	1	0	3	1	TA	7	Typ	2	Gd
Attchd	1960	Fin	2	528	TA	TA	P	210	62	0
0	0	0	NA	NA	NA	0	5	2010	WD	Normal	215000
…    …     …

There are 80 characteristics, each separated by the tab character, represented in Python by \t.

There are also 2930 listings, so it’s a good-sized dataset.

You can learn more about it in this project overview: http://jse.amstat.org/v19n3/decock.pdf and in this data description: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt.

Acknowledgment: This example is borrowed from the book Machine Learning with PyTorch and Scikit-Learn.


Pandas

We can pull this information into Python by first loading the Numpy and Pandas packages (the latter depends on the former):

import numpy as np
import pandas as pd

In Spyder, create a new blank document and paste these lines in; then save the document as AmesIowaHousing.py in your Scientific Python folder.

Set this document’s location as the console working directory (using the menu ≡ Options in its upper right corner).

Pandas is a full-fledged data manipulation extension for Python, and is very popular with data scientists. Here’s how it describes itself:

print(pd.__doc__)

⇒              
pandas - a powerful data analysis and manipulation library for Python
=====================================================================

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

Main Features
-------------
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data.
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.            

The full Pandas documentation is here: https://pandas.pydata.org/pandas-docs/stable/ . This Pandas Cheat Sheet is also really useful!

We saw how Numpy makes it easier to work with tabular and other multidimensional data. Pandas makes it easy to filter such data based on its characteristics, and also makes it easier to work with categorical tabular or multidimensional data, including data with missing values (represented in Python by the value None).

Pandas also provides a number of tools for importing data in standard file formats, such as CSV and tab-separated values such as the Ames Housing data.

Let’s import this set of housing data above into Pandas, directly from the Web, but limiting ourselves to a specific set of columns. Copy these lines into your python file, and run the file:

columns = ['Neighborhood', 'Overall Cond', 'Gr Liv Area', 'Year Built', 
            'Central Air', 'Total Bsmt SF', 'SalePrice']
housing  = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt', 
                       sep = '\t', usecols = columns)

By default, the method pandas.read_csv assumes that the first row is a set of column headers. The keyword argument and value header=None can be used to prevent this.

The keyword argument usecols = columns tells it to select just the particular columns in that list. These are expected to be plain text, so they are written as character strings here.

The keyword argument sep = '\t' tells Pandas to recognize a tab character as the column separator instead of the comma.

The keyword argument skiprows = n could be used to ignore the first n rows, where metadata might often be written (as in our own output dataset).

Pandas imports such files into a new data type that it defines, the data frame:

type(housing)


pandas.core.frame.DataFrame

To find out more about your data simply type its name, and Python will print its first few and last few rows and columns, and provide its overall size:

     Neighborhood  Overall Cond  Year Built  ...  Central Air Gr Liv Area  SalePrice
0           NAmes             5        1960  ...            Y        1656     215000
1           NAmes             6        1961  ...            Y         896     105000
2           NAmes             6        1958  ...            Y        1329     172000
3           NAmes             5        1968  ...            Y        2110     244000
4         Gilbert             5        1997  ...            Y        1629     189900
          ...           ...         ...  ...          ...         ...        ...
2925      Mitchel             6        1984  ...            Y        1003     142500
2926      Mitchel             5        1983  ...            Y         902     131000
2927      Mitchel             5        1992  ...            Y         970     132000
2928      Mitchel             5        1974  ...            Y        1389     170000
2929      Mitchel             5        1993  ...            Y        2000     188000

[2930 rows x 7 columns]

The pd.read_csv() function automatically converts recognized numerical values from text to numbers.

To see all of the columns, set this option:

pd.set_option("display.max_columns", 7)

housing
⇒
     Neighborhood  Overall Cond  Year Built  Total Bsmt SF Central Air  \
0           NAmes             5        1960         1080.0           Y   
1           NAmes             6        1961          882.0           Y   
2           NAmes             6        1958         1329.0           Y   
3           NAmes             5        1968         2110.0           Y   
4         Gilbert             5        1997          928.0           Y   
          ...           ...         ...            ...         ...   
2925      Mitchel             6        1984         1003.0           Y   
2926      Mitchel             5        1983          864.0           Y   
2927      Mitchel             5        1992          912.0           Y   
2928      Mitchel             5        1974         1389.0           Y   
2929      Mitchel             5        1993          996.0           Y   

      Gr Liv Area  SalePrice  
0            1656     215000  
1             896     105000  
2            1329     172000  
3            2110     244000  
4            1629     189900  
          ...        ...  
2925         1003     142500  
2926          902     131000  
2927          970     132000  
2928         1389     170000  
2929         2000     188000  

[2930 rows x 7 columns]

Note how the first row of header information is used to label each column.

There are 7 columns of data, five of which are numerical and two of which are qualitative and non-ordinal.

In statistical language, the columns are known as variables, though data scientists instead call them features, probably to distinguish that concept from the computer use of the former term. However, this is confusing to geographers, who use the latter term for the rows in a dataset! We can also, more distinctly, call them observations.

There are 2930 rows of data (i.e. the houses in the dataset). In statistical language, these are known as samples for each of the observations.

Each row is labeled, by default, with a numerical index, starting with 0 for the first. These don’t actually have to be numbers, though.

Note that the labels and row numbers “frame” the data.

A DataFrame is a Python object, and it therefore has both properties and methods associated with it.

The descriptive statistics of the numerical columns of a DataFrame can be quickly viewed with the method .describe():

housing.describe()

⇒
       Overall Cond   Year Built  Total Bsmt SF  Gr Liv Area      SalePrice
count   2930.000000  2930.000000    2929.000000  2930.000000    2930.000000
mean       5.563140  1971.356314    1051.614544  1499.690444  180796.060068
std        1.111537    30.245361     440.615067   505.508887   79886.692357
min        1.000000  1872.000000       0.000000   334.000000   12789.000000
25%        5.000000  1954.000000     793.000000  1126.000000  129500.000000
50%        5.000000  1973.000000     990.000000  1442.000000  160000.000000
75%        6.000000  2001.000000    1302.000000  1742.750000  213500.000000
max        9.000000  2010.000000    6110.000000  5642.000000  755000.000000

The value 50%, with half of the data below and half above, is what is known as the median value. For relatively symmetric data, the median will be close to the mean (average) value.

The columns are themselves properties of the dataframe, and can be referenced directly by appending a dot . and the feature name:

housing.SalePrice

0       215000
1       105000
2       172000
3       244000
4       189900
         ...
2925    142500
2926    131000
2927    132000
2928    170000
2929    188000
Name: SalePrice, Length: 2930, dtype: int64

Because some of the column names have embedded spaces, we need to use the alternative index syntax to reference them, with the label as a character string inside of square brackets:

housing['Gr Liv Area']

0       1656
1        896
2       1329
3       2110
4       1629
        ...
2925    1003
2926     902
2927     970
2928    1389
2929    2000
Name: Gr Liv Area, Length: 2930, dtype: int64

The individual columns are a one-dimensional datatype called a Series, which you can think of as a list.

type(housing.SalePrice)

pandas.core.series.Series

Both series and dataframes are capable of holding any type of Python data (integers, strings, floating point numbers, Python objects, etc.).

Because the rows are numbered, we can reference them with the property iloc (“indexed location”), which is implemented as an array, so it requires square brackets:

housing.iloc[0]

⇒
Neighborhood      NAmes
Overall Cond          5
Year Built         1960
Total Bsmt SF    1080.0
Central Air           Y
Gr Liv Area        1656
SalePrice        215000
Name: 0, dtype: object

But this also means we can use slice terminology, e.g. to get the first two rows:

housing.iloc[0:2]

⇒
  Neighborhood  Overall Cond  Year Built  Total Bsmt SF Central Air  \
0        NAmes             5        1960         1080.0           Y   
1        NAmes             6        1961          882.0           Y   

   Gr Liv Area  SalePrice  
0         1656     215000  
1          896     105000  

And to get one particular value, the columns also have a numerical index:

housing.iloc[1,3]

⇒
882.0

But so is the corresponding series:

housing['Total Bsmt SF'][1]

⇒
882.0

To extract an entire series with .iloc[], use an entire slice of rows (as with Numpy):

housing.iloc[:,3]

⇒
Out[139]: 
0       1080.0
1        882.0
2       1329.0
3       2110.0
4        928.0
 
2925    1003.0
2926     864.0
2927     912.0
2928    1389.0
2929     996.0
Name: Total Bsmt SF, Length: 2930, dtype: float64

Manipulating Data

Common data tasks include sorting values:

housing.sort_values(by='SalePrice')

⇒
     Neighborhood  Overall Cond  Year Built  Total Bsmt SF Central Air  \
181       OldTown             2        1923          678.0           N   
1553       IDOTRR             5        1952            0.0           N   
726        IDOTRR             5        1920          720.0           N   
2843      Edwards             3        1922          498.0           N   
2880       IDOTRR             3        1949          480.0           N   
          ...           ...         ...            ...         ...   
44        NridgHt             5        2009         2330.0           Y   
1063      NridgHt             5        2003         2535.0           Y   
2445      NoRidge             5        1995         1930.0           Y   
1760      NoRidge             5        1996         2396.0           Y   
1767      NoRidge             6        1994         2444.0           Y   

      Gr Liv Area  SalePrice  
181           832      12789  
1553          733      13100  
726           720      34900  
2843          498      35000  
2880          480      35311  
          ...        ...  
44           2364     611657  
1063         2470     615000  
2445         3627     625000  
1760         4476     745000  
1767         4316     755000  

[2930 rows x 7 columns]

Individual series are an extended class of numpy arrays, so they can be used in broadcasting expressions, for example:

cheap = housing.SalePrice < 40000
cheap

⇒
0       False
1       False
2       False
3       False
4       False
 
2925    False
2926    False
2927    False
2928    False
2929    False
Name: SalePrice, Length: 2930, dtype: bool

We can then use such logical series (or any logical list of the same length) to filter the data frame by including it inside of square brackets:

housing[cheap]

⇒
     Neighborhood  Overall Cond  Year Built  Total Bsmt SF Central Air  \
181       OldTown             2        1923          678.0           N   
709       OldTown             6        1910          600.0           N   
726        IDOTRR             5        1920          720.0           N   
1553       IDOTRR             5        1952            0.0           N   
1901      BrkSide             3        1946            0.0           N   
2843      Edwards             3        1922          498.0           N   
2880       IDOTRR             3        1949          480.0           N   

      Gr Liv Area  SalePrice  
181           832      12789  
709           968      37900  
726           720      34900  
1553          733      13100  
1901          334      39300  
2843          498      35000  
2880          480      35311

Another common practice is to summarize data by some characteristic, such as the column Neighborhoods, and calculate an associated value, e.g. the sale price:

SalePriceByNbhd = housing.SalePrice.groupby(by=housing.Neighborhood).mean()
SalePriceByNbhd ⇒ Neighborhood Blmngtn 196661.678571 Blueste 143590.000000 BrDale 105608.333333 BrkSide 124756.250000 ClearCr 208662.090909 CollgCr 201803.434457 Crawfor 207550.834951 Edwards 130843.381443 Gilbert 190646.575758 Greens 193531.250000 GrnHill 280000.000000 IDOTRR 103752.903226 Landmrk 137000.000000 MeadowV 95756.486486 Mitchel 162226.631579 NAmes 145097.349887 NPkVill 140710.869565 NWAmes 188406.908397 NoRidge 330319.126761 NridgHt 322018.265060 OldTown 123991.891213 SWISU 135071.937500 Sawyer 136751.152318 SawyerW 184070.184000 Somerst 229707.324176 StoneBr 324229.196078 Timber 246599.541667 Veenker 248314.583333 Name: SalePrice, dtype: float64

The result is a Series, labeled and sorted by the group name.

To sort this by SalesPrice:

SalePriceByNbhd.sort_values()

⇒
Neighborhood
MeadowV     95756.486486
IDOTRR     103752.903226
BrDale     105608.333333
OldTown    123991.891213
BrkSide    124756.250000
Edwards    130843.381443
SWISU      135071.937500
Sawyer     136751.152318
Landmrk    137000.000000
NPkVill    140710.869565
Blueste    143590.000000
NAmes      145097.349887
Mitchel    162226.631579
SawyerW    184070.184000
NWAmes     188406.908397
Gilbert    190646.575758
Greens     193531.250000
Blmngtn    196661.678571
CollgCr    201803.434457
Crawfor    207550.834951
ClearCr    208662.090909
Somerst    229707.324176
Timber     246599.541667
Veenker    248314.583333
GrnHill    280000.000000
NridgHt    322018.265060
StoneBr    324229.196078
NoRidge    330319.126761
Name: SalePrice, dtype: float64			

Data Conversion

For some calculations we may want to convert categorical variables that have an ordinal character into numeric values.

For example, the column Central Air has either yes (Y) or no (N) values, which we might want to convert to 1 and 0, respectively.

One fundamental python data type that we haven’t discussed yet is the dictionary, which is a set of key-value pairs:

ac = {'N' : 0, 'Y' : 1}

The values are referenced with a key in square brackets:

ac['Y']

⇒
1

We can use dictionaries and the method .map() to convert categorical variables to numeric values, en masse:

housingAC = housing['Central Air'].map(ac)
housingAC

⇒
0       1
1       1
2       1
3       1
4       1
       ..
2925    1
2926    1
2927    1
2928    1
2929    1
Name: Central Air, Length: 2930, dtype: int64

We can then replace the original column with this new series:

housing['Central Air'] = housingAC

Missing Values

Very commonly, data sets can have missing, “blank”, or “null” values, where there is no information provided; we can quickly check for that with the method .isnull(), which will return either True or False for each value.

The method .sum() will convert Trues and Falses to 1 or 0, respectively, and then add them up to tell us if any column has a null:

housing.isnull().sum()

⇒
Neighborhood     0
Overall Cond     0
Year Built       0
Total Bsmt SF    1
Central Air      0
Gr Liv Area      0
SalePrice        0
dtype: int64

We can see that Total Bsmt SF has such nulls, which we can locate with a filter:

housing[housing['Total Bsmt SF'].isnull()]
     Neighborhood  Overall Cond  Year Built  Total Bsmt SF Central Air  \
1341      BrkSide             7        1946            NaN           Y   

      Gr Liv Area  SalePrice  
1341          896      79000  

The original input field was blank, which pd.read_csv() replaces with the special floating point constant numpy.nan, “Not a Number”, which Pandas represents as the text NaN.

For non-floating-point values, Pandas will replace blank fields with None.

Pandas will ignore such values for most calculations, e.g. we saw above that:

housing['Total Bsmt SF'].mean()
1051.6145442130419

However, many other tools don’t handle nulls very well, so we will remove this particular row with the method .dropna():

housing = housing.dropna()
len(housing)

⇒
2929

Analyzing Tabular Data

The Scikit-Learn library provides the machine-learning tools that can analyze your data and build models for understanding them.


Machine Learning Setup

Going forward, we’ll make use of the package mlxtend (machine learning extensions), a Python library of useful tools for common data science tasks.

Switch to your shell program (Terminal on Macs, PowerShell on Windows) and run the following command:

conda install mlxtend


Linear Regression

Let’s do some basic modeling of this data set.

The simplest modeling of an observation is its mean value, e.g.:

SalePrice ~ Mean(SalePrice)

The symbol ~ is used to indicate this would not be an exact equality, because the data will scatter around the value on the right side.

The characteristics of the price of housing are:

housing.SalePrice.describe()
⇒
count      2930.000000
mean     180796.060068
std       79886.692357
min       12789.000000
25%      129500.000000
50%      160000.000000
75%      213500.000000
max      755000.000000
Name: SalePrice, dtype: float64

However, there is wide variance around the mean value, on the order of the standard deviation, which is 44% of the mean.

The idea behind regression analysis is that an observation will generally depend on one or more explanatory characteristics, such as the square footage of housing, and can be better described by them.

To visualize this, plot the two sets of data against each other (matplotlib doesn’t understand data frames, so we need to extract their values into Numpy arrays):

SalePrice = housing['SalePrice']
GrLivArea = housing['Gr Liv Area']
import matplotlib.pyplot as plt
plt.scatter(GrLivArea.values, SalePrice.values, label='Sale Price')
plt.plot([0, GrLivArea.max()], [SalePrice.mean()]*2, label='Mean')
plt.plot([0, GrLivArea.max()], [SalePrice.mean() + SalePrice.std()]*2, label='Mean + SD')
plt.plot([0, GrLivArea.max()], [SalePrice.mean() - SalePrice.std()]*2, label='Mean – SD')
plt.xlabel('Gr Liv Area')
plt.ylabel('SalePrice')
plt.legend()
plt.show()
plt.close()

We can see that there’s a lot of scatter around the mean value, and even outside of the bounds of plus-or-minus one standard deviation.

But there does appear to be an approximate linear relationship, which we can write as:

SalePrice ~ Slope * GrLivArea + Intercept

We can fit a line to these data points by minimizing the total distance between them, which is known as linear regression.

Commonly the square of the distance is used (as opposed to the absolute value), which is known as ordinary least squares regression (OLS).

To calculate this relationship, we’ll consider only the numeric columns, dropping the first column, Neighborhood:

housingNumeric = housing.iloc[:,1:]

It generally helps to visualize how this dependent variable is related to the possible independent variables by graphing them.

Since matplotlib doesn’t recognize the data frame type, we’ll extract these values into a Numpy array with the data frame property .values.

import matplotlib.pyplot as plt
from mlxtend.plotting import scatterplotmatrix
scatterplotmatrix(housingNumeric.values, figsize=(12, 10), names=housingNumeric.columns)
plt.tight_layout()
plt.show()
plt.close()

From these graphs, we can note linear relationships between SalePrice and most of these variables (the Count graphs are separate histograms of the variables).

Some of the other variables also appear correlated, e.g. Gr Liv Area and Total Bsmt SF, so only one of them may be needed to understand the SalePrice. We can test this hypothesis with the following heatmap of correlations between each pair of variables:

from mlxtend.plotting import heatmap
cm = np.corrcoef(housingNumeric.values.T)
hm = heatmap(cm, row_names=housingNumeric.columns, column_names=housingNumeric.columns)
plt.tight_layout()
plt.show()
plt.close()

The correlation numbers are 1 for perfectly correlated, 0 for no correlation, and -1 for perfectly oppositely correlated. The colors are chosen automaticaly and get progressively darker as the correlation decreases.

SalePrice is positively correlated with Gr Liv Area, Total Bsmt SF, Year Built, and Central Air.

Turns out that Gr Liv Area and Total Bsmt SF are not strongly correlated, after all.

There is a slight negative correlation between Overall Cond and Year Built (perhaps due to shoddier construction in more recent years?).

Let’s first consider the simple relationship between SalePrice and Gr Liv Area.

We’ll begin by standardizing the two observations so that they have a mean of zero and a standard deviation of 1, namely:

X_std = (X - µ)/σ

The result is a dimensionless quantity that will allow better comparison of the different observations by placing them in the same range of values.

The Scikit-Learn package contains a standard scaler for this purpose. It can handle Pandas dataframes as input, but will generate numpy arrays on output, so let’s first distinguish the intended response variable, SalePrice, from the rest of the data:

X = housingNumeric.iloc[:,:5]
y = housing['SalePrice']

We can then import the scaling function and apply it:

from sklearn.preprocessing import scale
X_scl = scale(X)
y_scl = scale(y)

Checking the result:

y_scl.mean()
⇒ -1.1280381431493567e-16

y_scl.std()
0.9999999999999999

The scaling is applied column-wise to the data frame:

X_scl[:,0].mean()
⇒ -3.857162683026833e-16

X_scl[:,0].mean()
⇒ 1.0                

The method .flatten() is used to return the response y to a linear array, expected by Scikit-Learn’s Linear Regression function:

from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_scl, y_scl)

lr.coef_
⇒
array([0.70662822])

lr.intercept_
⇒
-1.4794494024145013e-16

Previous:
Simulating a Magnet

Scientific Programming with Python:
Analyzing Demographic Data

 
 

Top of Page  | Using IT DocumentationIT Home  |  IT Site Map  |  Search IT