 # Sparse Matrix

## What is a sparse matrix?

A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. As a rule of thumb, if 2/3 of the total elements in a matrix are zeros, it can be called a sparse matrix. Using a sparse matrix representation — where only the non-zero values are stored — the space used for representing data and the time for scanning the matrix are reduced significantly.

Many applications in data science and machine learning involve sparse matrices, such as:

• Natural Language Processing: The occurrence of words in documents can be represented in a sparse matrix. The words in a document are only a small subset of words in a language. If we have a row for every document and a column for every word, then storing the number of word occurrences in a document has a high percentage of zeros in every column.
• Recommendation Systems: A sparse matrix can be employed to represent whether any particular user has watched any movie. See our Locality Sensitive Hashing (LSH) article for an example.
• Market Basket Analysis: Since the number of purchased items is tiny compared to the number of non-purchased items, a sparse matrix is used to represent all products and customers.

## Numerical example 1

Let's take the example of a movie recommendation system. There are millions of users and thousands of movies, so it's not possible for users to watch and rate all movies. This data can be represented as a matrix where the rows are the users, and the columns are the movies.

Most of the matrix elements will be empty, where the missing values will be replaced with zeros. Since a small percentage of the matrix has non-zero values, this matrix can be considered a sparse matrix. A small portion of the data is given below:

Movie1Movie2Movie3Movie4Movie5Movie6Movie7
User10003004
User20500000
User30050040
User44000001
User50200300

The sparsity of this matrix can be calculated by obtaining the ratio of zero elements to total elements. For this example, sparsity is calculated as:

\begin{align} sparsity &= \frac {n_{zeros}}{n_{total}} \\[.5em] &= \frac{26}{35} \\[.5em] &= 0.742 \end{align}

It can be seen that the number of zeros in a sparse matrix is very high. Representing all zero values in a matrix like this would result in high memory usage, so in practice, only non-zero values of the sparse matrix are stored.

## Numerical example 2

Another example would be to use a matrix to represent the occurrence of words in documents. The term-document matrix dimension will be $n \times m$, where $n$ is the number of documents and $m$ is the number of words in the language model. As a result, most of the matrix elements will be zero since only non-zero values are important for data analysis. In addition to a large amount of space used, there will be a computational time problem because all elements will be scanned to access non-zero elements. This process yields a computational complexity problem.

To overcome these problems, we can use different data structures to represent a sparse matrix. One common representation format for a sparse matrix is a Coordinate list (COO), which uses three-element tuples to store non-zero values' coordinates in a matrix. For example, the following table can be constructed to represent a sparse term-document matrix:

RowColumnValue
033
064
115
225
254
304
361
412
443

In this table, indices of rows and columns of non-zero values are stored in a sparse representation. Let $k$ be the number of non-zero elements in a matrix of size $n \times m$, then the proportion of the space saved by sparse matrix representation can simply be calculated as follows:

$$p = 1- \frac{3k}{nm}$$

The space gained by a sparse matrix representation is directly proportional to the sparsity value.

There are many other ways to represent a sparse matrix, such as Dictionary of keys (DOK) and List of lists (LIL). In the following section, different representation formats will be explained with Python.

## Sparse Matrix in Python

The Scipy library provides the scipy.sparse package to create and manipulate sparse matrix (https://docs.scipy.org/doc/scipy/reference/sparse.html. Different representation formats and some useful functions for sparse matrices are defined in this package.

Here we will explore some basic functions.

### A simple, sparse matrix

A simple, sparse matrix will be constructed to show the representation formats of a sparse matrix in Python.

import numpy as np
from scipy import sparse
X = np.array([[0,0,0,3,0,0,4],
[0,5,0,0,0,0,0],
[0,0,5,0,0,4,0],
[4,0,0,0,0,0,1],
[0,2,0,0,3,0,0]])
print(X)
Out:
[[0 0 0 3 0 0 4]
[0 5 0 0 0 0 0]
[0 0 5 0 0 4 0]
[4 0 0 0 0 0 1]
[0 2 0 0 3 0 0]]

There's many zeros in this matrix, so let's calculate the sparsity of the matrix:

sparsity = 1.0 - (np.count_nonzero(X) / X.size)
print('The sparsity of X is ', sparsity )
Out:
The sparsity of X is  0.7428571428571429

We can convert this dense matrix into a sparse matrix by using the sparse.csr_matrix()function. The row/column indices of non-zero values are stored in a Compressed Sparse Row (CSR) matrix:

# Convert X to a sparse matrix

S1 = sparse.csr_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S1)}

Sparse Matrix:\n{S1}

Sparse Data: {S1.data}

Indices of columns: {S1.indices}

Pointers for data: {S1.indptr}
""")
Out:
Type of sparse matrix representation: <class 'scipy.sparse.csr.csr_matrix'>

Sparse Matrix:
(0, 3)	3
(0, 6)	4
(1, 1)	5
(2, 2)	5
(2, 5)	4
(3, 0)	4
(3, 6)	1
(4, 1)	2
(4, 4)	3

Sparse Data: [3 4 5 5 4 4 1 2 3]

Indices of columns: [3 6 1 2 5 0 6 1 4]

Pointers for data: [0 2 3 5 7 9]

Another efficient structure for constructing sparse matrices is the Dictionary Of Keys (DOK), where a python dictionary is used to represent non-zero values for a sparse matrix.

In this representation, keys() is used for indices, and values() is used for values of non-zero elements:

S2 = sparse.dok_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S2)}

Sparse Matrix:\n{S2}

Keys in dictionary: {S2.keys()}

Values in dictionary: {S2.values()}
""")
Out:
Type of sparse matrix representation: <class 'scipy.sparse.dok.dok_matrix'>

Sparse Matrix:
(0, 3)	3
(0, 6)	4
(1, 1)	5
(2, 2)	5
(2, 5)	4
(3, 0)	4
(3, 6)	1
(4, 1)	2
(4, 4)	3

Keys in dictionary: dict_keys([(0, 3), (0, 6), (1, 1), (2, 2), (2, 5), (3, 0), (3, 6), (4, 1), (4, 4)])

Values in dictionary: dict_values([3, 4, 5, 5, 4, 4, 1, 2, 3])

The last representation format shown here is a row-based list of lists sparse (LIL) matrix. The first list stores column indices for each row, and the second list is used to store the element's row values.

S3 = sparse.lil_matrix(X)

print(f"""
Type of sparse matrix representation: {type(S3)}

Sparse Matrix:\n{S3}

Lists for rows: {S3.rows}

Lists for columns: {S3.data}
""")
Out:
Type of sparse matrix representation: <class 'scipy.sparse.lil.lil_matrix'>

Sparse Matrix:
(0, 3)	3
(0, 6)	4
(1, 1)	5
(2, 2)	5
(2, 5)	4
(3, 0)	4
(3, 6)	1
(4, 1)	2
(4, 4)	3

Lists for rows: [list([3, 6]) list() list([2, 5]) list([0, 6]) list([1, 4])]

Lists for columns: [list([3, 4]) list() list([5, 4]) list([4, 1]) list([2, 3])]

In scipy.sparse package, there is also a todense()function for converting a sparse matrix to a dense matrix:

# Convert the sparse matrix to a dense matrix
X = S1.todense()

print(X)
Out:
[[0 0 0 3 0 0 4]
[0 5 0 0 0 0 0]
[0 0 5 0 0 4 0]
[4 0 0 0 0 0 1]
[0 2 0 0 3 0 0]]

This can be useful when exploring data, but if your dataset is large, the dense matrix version won't fit in memory and may cause an error.

### Sparse Matrix from a real dataset

We'll use the newsgroups dataset, available directly from sklearn, to show an example of when a sparse matrix would be used. This dataset contains thousands of news posts on 20 different topics.

A document can be represented as a term vector, where each term is a feature and the value is the number of times the corresponding term occurs in the document. In this way, a matrix can be used to represent word occurrences in all documents, where the documents are the rows and the terms are the columns. Since the number of non-zero values will be small, this matrix can be represented as a sparse matrix.

Let's import and load the dataset from sklearn:

from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train', categories= ['sci.electronics', 'sci.space'])

Previewing an example entry in this dataset:

newsgroups_train.data
Out:
"From: cab@col.hp.com (Chris Best)\nSubject: Re: Food Dehydrators\nOrganization: your service\nLines: 10\nDistribution: usa\nNNTP-Posting-Host: hpctdkz.col.hp.com\n\n>   Does anybody out there have one of those food dehydrators I've been seeing\n> all over late-night TV recently? I was wondering if they use forced air, heat,\n> or both. If there's heat involved, anybody know what temperature they run at?\n> My wife would like one and I'm not inclined to pay >\$100.00 for a box, a fan\n> and a heater. Seems to me you should be able to throw a dehydrator together\n> for just a few bucks. Heck, the technology is only what? 1,000 years old?\n\n----------\n\nYeah, but 1000 years ago, you couldn't buy it from a guy with sprayed-on hair!\n"

Now we'll use CountVectorizer to vectorize the text into a term-document matrix:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
cv.fit(newsgroups_train.data)

# Create a term-document matrix
word_matrix = cv.transform(newsgroups_train.data)
print(f'Type of the matrix is: {type(word_matrix)}\n')

print(f'Matrix:\n{word_matrix}')
Out:
Type of the matrix is: <class 'scipy.sparse.csr.csr_matrix'>

Matrix:
(0, 0)	1
(0, 1)	1
(0, 187)	1
(0, 188)	1
(0, 189)	1
(0, 3013)	1
(0, 3387)	1
(0, 3424)	1
(0, 3502)	1
(0, 3688)	2
(0, 3794)	2
(0, 4144)	1
(0, 4514)	1
(0, 4557)	1
(0, 4650)	1
(0, 4936)	1
(0, 4962)	1
(0, 5127)	1
(0, 5225)	1
(0, 5234)	1
(0, 5360)	1
(0, 5912)	1
(0, 6180)	2
(0, 6237)	2
(0, 6822)	1
:	:
(1183, 20345)	1
(1183, 20370)	1
(1183, 20381)	1
(1183, 20386)	1
(1183, 20398)	2
(1183, 20416)	1
(1183, 20509)	1
(1183, 20548)	11
(1183, 20975)	1
(1183, 21002)	2
(1183, 21077)	2
(1183, 21224)	1
(1183, 21305)	1
(1183, 21344)	3
(1183, 21345)	2
(1183, 21358)	1
(1183, 21541)	1
(1183, 21892)	1
(1183, 22075)	1
(1183, 22096)	1
(1183, 22139)	1
(1183, 22233)	2
(1183, 22318)	3
(1183, 22347)	1
(1183, 22478)	5
print(f'Size of the matrix is: {word_matrix.shape}')
Out:
Size of the matrix is: (1184, 22577)
print(f'Number of the Non-zero values: {word_matrix.nnz}')
Out:
Number of the Non-zero values: 174709

Next, we'll can calculate the sparsity:

sparsity = word_matrix.nnz / (word_matrix.shape * word_matrix.shape)

print('Sparsity value: ', sparsity)
Out:
Sparsity value:  0.006535778758339329

Convert the sparse matrix to a dense matrix:

D = word_matrix.todense()

print(D)
Out:
[[1 1 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 2 0 ... 0 0 0]
[0 1 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]

With the dense version we can calculate the proportion of space saved by sparse matrix represantation:

r = 1 - (3 * np.count_nonzero(D)) / D.size

print('The proportion of saved space is ', r)
Out:
The proportion of saved space is  0.980392663724982 