![]() |
|
|||
In the natural sciences, tabular data results from all kinds of measurements in physics, chemistry, biology, geology, and environmental science. In the quantitative social sciences, from economics to sociology and psychology, survey and demographic data provide the background necessary to understand relationships between people, institutions, and geography. The Pandas package makes it much easier to work with such data in Python, and the statsmodels package provides statistical tools to analyze your data and basic modeling. The Scikit-Learn library provides additional tools that can build machine-learning models for predictive purposes.
Working with Tabular DataThe Pandas module provides new data types and specialized features for expressive handling of tabular data, even multidimensional data. Ames HousingAs an example dataset, let’s consider a collection of data about housing in Ames, Iowa between 2006 and 2010, which you can view at this site: http://jse.amstat.org/v19n3/decock/AmesHousing.txt. Since it’s a text file, you’ll initially see the entire dataset: Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour Utilities Lot Config Land Slope Neighborhood Condition 1 Condition 2 Bldg Type House Style Overall Qual Overall Cond Year Built Year Remod/Add Roof Style Roof Matl Exterior 1st Exterior 2nd Mas Vnr Type Mas Vnr Area Exter Qual Exter Cond Foundation Bsmt Qual Bsmt Cond Bsmt Exposure BsmtFin Type 1 BsmtFin SF 1 BsmtFin Type 2 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF Heating Heating QC Central Air Electrical 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr Kitchen Qual TotRms AbvGrd Functional Fireplaces Fireplace Qu Garage Type Garage Yr Blt Garage Finish Garage Cars Garage Area Garage Qual Garage Cond Paved Drive Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice 1 0526301100 020 RL 141 31770 Pave NA IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 5 1960 1960 Hip CompShg BrkFace Plywood Stone 112 TA TA CBlock TA Gd Gd BLQ 639 Unf 0 441 1080 GasA Fa Y SBrkr 1656 0 0 1656 1 0 1 0 3 1 TA 7 Typ 2 Gd Attchd 1960 Fin 2 528 TA TA P 210 62 0 0 0 0 NA NA NA 0 5 2010 WD Normal 215000 … … … There are 80 characteristics, each separated by the tab character, represented in Python by There are also 2930 listings, so it’s a good-sized dataset. You can learn more about it in this project overview: http://jse.amstat.org/v19n3/decock.pdf and in this data description: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt. Acknowledgment: This example is borrowed from the book Machine Learning with PyTorch and Scikit-Learn. PandasWe can pull this information into Python by first loading the Numpy and Pandas packages (the latter depends on the former): import numpy as np In Spyder, create a new blank document and paste these lines in; then save the document as AmesIowaHousing.py in your Scientific Python folder. Set this document’s location as the console working directory (using the menu ≡ Options in its upper right corner). Pandas is a full-fledged data manipulation extension for Python, and is very popular with data scientists. Here’s how it describes itself: print(pd.__doc__) ⇒
pandas - a powerful data analysis and manipulation library for Python
=====================================================================
**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.
Main Features
-------------
Here are just a few of the things that pandas does well:
- Easy handling of missing data in floating point as well as non-floating
point data.
- Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned
to a set of labels, or the user can simply ignore the labels and let
`Series`, `DataFrame`, etc. automatically align the data for you in
computations.
- Powerful, flexible group by functionality to perform split-apply-combine
operations on data sets, for both aggregating and transforming data.
- Make it easy to convert ragged, differently-indexed data in other Python
and NumPy data structures into DataFrame objects.
- Intelligent label-based slicing, fancy indexing, and subsetting of large
data sets.
- Intuitive merging and joining data sets.
- Flexible reshaping and pivoting of data sets.
- Hierarchical labeling of axes (possible to have multiple labels per tick).
- Robust IO tools for loading data from flat files (CSV and delimited),
Excel files, databases, and saving/loading data from the ultrafast HDF5
format.
- Time series-specific functionality: date range generation and frequency
conversion, moving window statistics, date shifting and lagging.
The full Pandas documentation is here: https://pandas.pydata.org/pandas-docs/stable/ . This Pandas Cheat Sheet is also really useful! We saw how Numpy makes it easier to work with tabular and other multidimensional data. Pandas makes it easy to filter such data based on its characteristics, and also makes it easier to work with categorical tabular or multidimensional data, including data with missing values (represented in Python by the value Pandas also provides a number of tools for importing data in standard file formats, such as CSV and tab-separated values such as the Ames Housing data. Let’s import this set of housing data above into Pandas, directly from the Web, but limiting ourselves to a specific set of columns. Copy these lines into your python file, and run the file: columns = ['Neighborhood', 'Overall Cond', 'Gr Liv Area', 'Year Built',
'Central Air', 'Total Bsmt SF', 'SalePrice']
housing = pd.read_csv('http://jse.amstat.org/v19n3/decock/AmesHousing.txt',
sep = '\t', usecols = columns)
By default, the method The keyword argument The keyword argument The keyword argument Pandas imports such files into a new data type that it defines, the data frame: type(housing) ⇒ To find out more about your data simply type its name, and Python will print its first few and last few rows and columns, and provide its overall size: Neighborhood Overall Cond Year Built ... Central Air Gr Liv Area SalePrice
0 NAmes 5 1960 ... Y 1656 215000
1 NAmes 6 1961 ... Y 896 105000
2 NAmes 6 1958 ... Y 1329 172000
3 NAmes 5 1968 ... Y 2110 244000
4 Gilbert 5 1997 ... Y 1629 189900
... ... ... ... ... ... ...
2925 Mitchel 6 1984 ... Y 1003 142500
2926 Mitchel 5 1983 ... Y 902 131000
2927 Mitchel 5 1992 ... Y 970 132000
2928 Mitchel 5 1974 ... Y 1389 170000
2929 Mitchel 5 1993 ... Y 2000 188000
[2930 rows x 7 columns]
The To see all of the columns, set this option: pd.set_option("display.max_columns", 7)
housing
⇒
Neighborhood Overall Cond Year Built Total Bsmt SF Central Air \
0 NAmes 5 1960 1080.0 Y
1 NAmes 6 1961 882.0 Y
2 NAmes 6 1958 1329.0 Y
3 NAmes 5 1968 2110.0 Y
4 Gilbert 5 1997 928.0 Y
... ... ... ... ...
2925 Mitchel 6 1984 1003.0 Y
2926 Mitchel 5 1983 864.0 Y
2927 Mitchel 5 1992 912.0 Y
2928 Mitchel 5 1974 1389.0 Y
2929 Mitchel 5 1993 996.0 Y
Gr Liv Area SalePrice
0 1656 215000
1 896 105000
2 1329 172000
3 2110 244000
4 1629 189900
... ...
2925 1003 142500
2926 902 131000
2927 970 132000
2928 1389 170000
2929 2000 188000
[2930 rows x 7 columns]
Note how the first row of header information is used to label each column. There are 7 columns of data, five of which are numerical and two of which are qualitative and non-ordinal. In statistical language, the columns are known as variables, though data scientists instead call them features, probably to distinguish that concept from the computer use of the former term. However, this is confusing to geographers, who use the latter term for the rows in a dataset! We can also, more distinctly, call them observations. There are 2930 rows of data (i.e. the houses in the dataset). In statistical language, these are known as samples for each of the observations. Each row is labeled, by default, with a numerical index, starting with Note that the labels and row numbers “frame” the data. A The descriptive statistics of the numerical columns of a housing.describe() ⇒
Overall Cond Year Built Total Bsmt SF Gr Liv Area SalePrice
count 2930.000000 2930.000000 2929.000000 2930.000000 2930.000000
mean 5.563140 1971.356314 1051.614544 1499.690444 180796.060068
std 1.111537 30.245361 440.615067 505.508887 79886.692357
min 1.000000 1872.000000 0.000000 334.000000 12789.000000
25% 5.000000 1954.000000 793.000000 1126.000000 129500.000000
50% 5.000000 1973.000000 990.000000 1442.000000 160000.000000
75% 6.000000 2001.000000 1302.000000 1742.750000 213500.000000
max 9.000000 2010.000000 6110.000000 5642.000000 755000.000000
The value The columns are themselves properties of the dataframe, and can be referenced directly by appending a dot housing.SalePrice ⇒
0 215000
1 105000
2 172000
3 244000
4 189900
...
2925 142500
2926 131000
2927 132000
2928 170000
2929 188000
Name: SalePrice, Length: 2930, dtype: int64
Because some of the column names have embedded spaces, we need to use the alternative index syntax to reference them, with the label as a character string inside of square brackets: housing['Gr Liv Area'] ⇒
0 1656
1 896
2 1329
3 2110
4 1629
...
2925 1003
2926 902
2927 970
2928 1389
2929 2000
Name: Gr Liv Area, Length: 2930, dtype: int64
The individual columns are a one-dimensional datatype called a Series, which you can think of as a list. type(housing.SalePrice) ⇒ pandas.core.series.Series Both series and dataframes are capable of holding any type of Python data (integers, strings, floating point numbers, Python objects, etc.). Because the rows are numbered, we can reference them with the property housing.iloc[0] ⇒ Neighborhood NAmes Overall Cond 5 Year Built 1960 Total Bsmt SF 1080.0 Central Air Y Gr Liv Area 1656 SalePrice 215000 Name: 0, dtype: object But this also means we can use slice terminology, e.g. to get the first two rows: housing.iloc[0:2] ⇒ Neighborhood Overall Cond Year Built Total Bsmt SF Central Air \ 0 NAmes 5 1960 1080.0 Y 1 NAmes 6 1961 882.0 Y Gr Liv Area SalePrice 0 1656 215000 1 896 105000 And to get one particular value, the columns also have a numerical index: housing.iloc[1,3] ⇒ 882.0 But so is the corresponding series: housing['Total Bsmt SF'][1] ⇒ 882.0 To extract an entire series with housing.iloc[:,3] ⇒ Out[139]: 0 1080.0 1 882.0 2 1329.0 3 2110.0 4 928.0 2925 1003.0 2926 864.0 2927 912.0 2928 1389.0 2929 996.0 Name: Total Bsmt SF, Length: 2930, dtype: float64 Manipulating DataCommon data tasks include sorting values: housing.sort_values(by='SalePrice') ⇒
Neighborhood Overall Cond Year Built Total Bsmt SF Central Air \
181 OldTown 2 1923 678.0 N
1553 IDOTRR 5 1952 0.0 N
726 IDOTRR 5 1920 720.0 N
2843 Edwards 3 1922 498.0 N
2880 IDOTRR 3 1949 480.0 N
... ... ... ... ...
44 NridgHt 5 2009 2330.0 Y
1063 NridgHt 5 2003 2535.0 Y
2445 NoRidge 5 1995 1930.0 Y
1760 NoRidge 5 1996 2396.0 Y
1767 NoRidge 6 1994 2444.0 Y
Gr Liv Area SalePrice
181 832 12789
1553 733 13100
726 720 34900
2843 498 35000
2880 480 35311
... ...
44 2364 611657
1063 2470 615000
2445 3627 625000
1760 4476 745000
1767 4316 755000
[2930 rows x 7 columns]
Individual series are an extended class of numpy arrays, so they can be used in broadcasting expressions, for example: cheap = housing.SalePrice < 40000 cheap ⇒ 0 False 1 False 2 False 3 False 4 False 2925 False 2926 False 2927 False 2928 False 2929 False Name: SalePrice, Length: 2930, dtype: bool We can then use such logical series (or any logical list of the same length) to filter the data frame by including it inside of square brackets: housing[cheap]
⇒
Neighborhood Overall Cond Year Built Total Bsmt SF Central Air \
181 OldTown 2 1923 678.0 N
709 OldTown 6 1910 600.0 N
726 IDOTRR 5 1920 720.0 N
1553 IDOTRR 5 1952 0.0 N
1901 BrkSide 3 1946 0.0 N
2843 Edwards 3 1922 498.0 N
2880 IDOTRR 3 1949 480.0 N
Gr Liv Area SalePrice
181 832 12789
709 968 37900
726 720 34900
1553 733 13100
1901 334 39300
2843 498 35000
2880 480 35311
Another common practice is to summarize data by some characteristic, such as the column SalePriceByNbhd = housing.SalePrice.groupby(by=housing.Neighborhood).mean() The result is a Series, labeled and sorted by the group name. To sort this by SalesPrice: SalePriceByNbhd.sort_values() ⇒ Neighborhood MeadowV 95756.486486 IDOTRR 103752.903226 BrDale 105608.333333 OldTown 123991.891213 BrkSide 124756.250000 Edwards 130843.381443 SWISU 135071.937500 Sawyer 136751.152318 Landmrk 137000.000000 NPkVill 140710.869565 Blueste 143590.000000 NAmes 145097.349887 Mitchel 162226.631579 SawyerW 184070.184000 NWAmes 188406.908397 Gilbert 190646.575758 Greens 193531.250000 Blmngtn 196661.678571 CollgCr 201803.434457 Crawfor 207550.834951 ClearCr 208662.090909 Somerst 229707.324176 Timber 246599.541667 Veenker 248314.583333 GrnHill 280000.000000 NridgHt 322018.265060 StoneBr 324229.196078 NoRidge 330319.126761 Name: SalePrice, dtype: float64 Data ConversionFor some calculations we may want to convert categorical variables that have an ordinal character into numeric values. For example, the column One fundamental python data type that we haven’t discussed yet is the dictionary, which is a set of key-value pairs: ac = {'N' : 0, 'Y' : 1}
The values are referenced with a key in square brackets: ac['Y'] ⇒ 1 We can use dictionaries and the method housingAC = housing['Central Air'].map(ac)
housingAC
⇒
0 1
1 1
2 1
3 1
4 1
..
2925 1
2926 1
2927 1
2928 1
2929 1
Name: Central Air, Length: 2930, dtype: int64
We can then replace the original column with this new series: housing['Central Air'] = housingAC Missing ValuesVery commonly, data sets can have missing, “blank”, or “null” values, where there is no information provided; we can quickly check for that with the method The method housing.isnull().sum() ⇒ Neighborhood 0 Overall Cond 0 Year Built 0 Total Bsmt SF 1 Central Air 0 Gr Liv Area 0 SalePrice 0 dtype: int64 We can see that housing[housing['Total Bsmt SF'].isnull()]
Neighborhood Overall Cond Year Built Total Bsmt SF Central Air \
1341 BrkSide 7 1946 NaN Y
Gr Liv Area SalePrice
1341 896 79000
The original input field was blank, which For non-floating-point values, Pandas will replace blank fields with Pandas will ignore such values for most calculations, e.g. we saw above that: housing['Total Bsmt SF'].mean() 1051.6145442130419 However, many other tools don’t handle nulls very well, so we will remove this particular row with the method housing = housing.dropna() len(housing) ⇒ 2929 Analyzing Tabular DataThe Scikit-Learn library provides the machine-learning tools that can analyze your data and build models for understanding them. Machine Learning SetupGoing forward, we’ll make use of the package mlxtend (machine learning extensions), a Python library of useful tools for common data science tasks. Switch to your shell program (Terminal on Macs, PowerShell on Windows) and run the following command: conda install mlxtend Linear RegressionLet’s do some basic modeling of this data set. The simplest modeling of an observation is its mean value, e.g.: SalePrice ~ Mean(SalePrice) The symbol The characteristics of the price of housing are: housing.SalePrice.describe() ⇒ count 2930.000000 mean 180796.060068 std 79886.692357 min 12789.000000 25% 129500.000000 50% 160000.000000 75% 213500.000000 max 755000.000000 Name: SalePrice, dtype: float64 However, there is wide variance around the mean value, on the order of the standard deviation, which is 44% of the mean. The idea behind regression analysis is that an observation will generally depend on one or more explanatory characteristics, such as the square footage of housing, and can be better described by them. To visualize this, plot the two sets of data against each other (
SalePrice = housing['SalePrice']
GrLivArea = housing['Gr Liv Area']
import matplotlib.pyplot as plt
plt.scatter(GrLivArea.values, SalePrice.values, label='Sale Price')
plt.plot([0, GrLivArea.max()], [SalePrice.mean()]*2, label='Mean')
plt.plot([0, GrLivArea.max()], [SalePrice.mean() + SalePrice.std()]*2, label='Mean + SD')
plt.plot([0, GrLivArea.max()], [SalePrice.mean() - SalePrice.std()]*2, label='Mean – SD')
plt.xlabel('Gr Liv Area')
plt.ylabel('SalePrice')
plt.legend()
plt.show()
plt.close()
We can see that there’s a lot of scatter around the mean value, and even outside of the bounds of plus-or-minus one standard deviation. But there does appear to be an approximate linear relationship, which we can write as: SalePrice ~ Slope * GrLivArea + Intercept We can fit a line to these data points by minimizing the total distance between them, which is known as linear regression. Commonly the square of the distance is used (as opposed to the absolute value), which is known as ordinary least squares regression (OLS). To calculate this relationship, we’ll consider only the numeric columns, dropping the first column, housingNumeric = housing.iloc[:,1:] It generally helps to visualize how this dependent variable is related to the possible independent variables by graphing them. Since matplotlib doesn’t recognize the data frame type, we’ll extract these values into a Numpy array with the data frame property import matplotlib.pyplot as plt from mlxtend.plotting import scatterplotmatrix scatterplotmatrix(housingNumeric.values, figsize=(12, 10), names=housingNumeric.columns) plt.tight_layout() plt.show() plt.close()
From these graphs, we can note linear relationships between Some of the other variables also appear correlated, e.g. from mlxtend.plotting import heatmap cm = np.corrcoef(housingNumeric.values.T) hm = heatmap(cm, row_names=housingNumeric.columns, column_names=housingNumeric.columns) plt.tight_layout() plt.show() plt.close() The correlation numbers are 1 for perfectly correlated, 0 for no correlation, and -1 for perfectly oppositely correlated. The colors are chosen automaticaly and get progressively darker as the correlation decreases.
Turns out that There is a slight negative correlation between Let’s first consider the simple relationship between We’ll begin by standardizing the two observations so that they have a mean of zero and a standard deviation of 1, namely: X_std = (X - µ)/σ The result is a dimensionless quantity that will allow better comparison of the different observations by placing them in the same range of values. The Scikit-Learn package contains a standard scaler for this purpose. It can handle Pandas dataframes as input, but will generate numpy arrays on output, so let’s first distinguish the intended response variable, X = housingNumeric.iloc[:,:5] y = housing['SalePrice'] We can then import the scaling function and apply it: from sklearn.preprocessing import scale X_scl = scale(X) y_scl = scale(y) Checking the result: y_scl.mean() ⇒ -1.1280381431493567e-16 y_scl.std() 0.9999999999999999 The scaling is applied column-wise to the data frame: X_scl[:,0].mean() ⇒ -3.857162683026833e-16 X_scl[:,0].mean() ⇒ 1.0 The method from sklearn.linear_model import LinearRegression lr = LinearRegression().fit(X_scl, y_scl) lr.coef_ ⇒ array([0.70662822]) lr.intercept_ ⇒ -1.4794494024145013e-16
|
||||||||
|
|
||||||||