R

dB

Try R

R is a language and environment for statistical computing and graphics.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible.
One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

Try R

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

untitled

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:
 An effective data handling and storage facility.
 A suite of operators for calculations on arrays, in particular matrices.
 A large, coherent, integrated collection of intermediate tools for data analysis.
 Graphical facilities for data analysis and display either on-screen or on hardcopy.
 A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

To see how it works, check out the Try R course from Code School –http://tryr.codeschool.com/

Shelter Animal Outcomes

Presentation1

Help improve outcomes for shelter animals

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.

In base of a dataset of intake information including breed, colour, sex, and age from the Austin Animal Centre (http://www.austintexas.gov/department/animal-services), will be predicted the outcome for each animal.

This dataset can help to understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home.

Is used the randomForest classification algorithm to predict shelter outcomes. It is important  to see how the outcomes are distributed for the 4800 cats and 6656 dogs in the training set.

# Plot
ggplot(outcomes, aes(x = AnimalType, y = num_animals, fill = OutcomeType)) +
geom_bar(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +
coord_flip() +
labs(y = ‘Proportion of Animals’,
x = ‘Animal’,
title = ‘Outcomes: Cats & Dogs’) +
theme_few()

Cats&Dogs
Outcomes: Cats &Dogs

Rplot01

Both cats and dogs are commonly adopted or transferred (cats more so), but dogs are much more likely to be returned to their owners than cats. It also appears that cats are more likely to have died compared to dogs. Fortunately, it appears very few animals die or get euthanized o#

Plot
ggplot(daytimes, aes(x = TimeofDay, y = num_animals, fill = OutcomeType)) +
geom_bar(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +
facet_wrap(~AnimalType) +
coord_flip() +
labs(y = ‘Proportion of Animals’,
x = ‘Animal’,
title = ‘Outcomes by Time of Day: Cats & Dogs’) +
theme_few()verall.

Outcome by Time of Day
Outcome by Time of Day Cat & Dogs

 

Rplot02

The conclusion of this plot is the fact that dogs are most often euthanized in the morning.
The Breed variable has way too many levels; 1678 to be exact be going to deal with this in part by contrasting mixes with non-mixes. It will also use strsplit and gsub to grab just the first breed if there are multiple breeds split by “/” and removing “Mix” from the mix.

# Plot
ggplot(intact, aes(x = Intact, y = num_animals, fill = OutcomeType)) +
geom_bar(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +
facet_wrap(~AnimalType) +
coord_flip() +
labs(y = ‘Proportion of Animals’,
x = ‘Animal’,
title = ‘Outcomes by Intactness: Cats & Dogs’) +
theme_few()

Intactness Cats&Dogs
Outcome by Intactness Cats & Dogs

 

Rplot03

Animals are much more likely to be adopted if they’ve been neutered. Smaller proportions of neutered animals end up euthanized or dying.
Because were 24 missing values of Age of Days, in the next step was used the rpart function to fit a decision tree predicting animal Age in Days by the great new variables were created or fixed.
It is evident that temporal factors should have anything to do with animals’ ages upon their respective outcomes, so not everything is thrown

# Plot in ggplot2
ggplot(full[1:26729, ], aes(x = Lifestage, fill = OutcomeType)) +
geom_bar(position = ‘fill’, colour = ‘black’) +
labs(y = ‘Proportion’, title = ‘Animal Outcome: Babies versus Adults’) +
theme_few()

Baby & Adults
Animal Outcome: Babies versus Adults

 

Rplot04

Unsurprisingly, baby animals are more likely to be adopted than adult animals. They are also more likely to be transferred and to have died.

# Split up train and test data
train <- full[1:26729, ]
test  <- full[26730:nrow(full), ]

# Set a random seed
set.seed(731)

# Build the model
rf_mod <- randomForest(OutcomeType ~ AnimalType+AgeinDays+Intact+HasName+Hour+Weekday+TimeofDay+SimpleColor+IsMix+Sex+Month,
data = train,
ntree = 600,
importance = TRUE)

# Show model error
plot(rf_mod, ylim=c(0,1))
legend(‘topright’, colnames(rf_mod$err.rate), col=1:6, fill=1:6)

rf mod
rf_mod

 

Rplot05

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance, aes(x = reorder(Variables, Importance),
y = Importance)) +
geom_bar(stat=’identity’, colour = ‘black’) +
geom_text(aes(x = Variables, y = 0.5, label = Rank),
hjust=0, vjust=0.55, size = 4, colour = ‘lavender’,
fontface = ‘bold’) +
labs(x = ‘Variables’, title = ‘Relative Variable Importance’) +
coord_flip() +
theme_few()

Relative Variable Importance
Relative Variable Importance

Rplot06

The most important variable for predicting the outcomes of shelter animals is AgeinDays – and not Intact. Hour of the day and weekday are not doing too badly either. SimpleColor is ranked above Sex and IsMix.

References

Kaggle Team (2016) ‘Shelter Animal Outcomes’ [Online]. Kaggle. Available from: https://www.kaggle.com/c/shelter-animal-outcomes [Accessed 20st July 2016].

Kurt Hornik, K. (2016) ‘Frequently Asked Questions on R’ [Online]. R FAQ. Available from: https://cran.r-project.org/doc/FAQ/R-FAQ.html [Accessed 21st July 2016].

O’Reilly (2016) ‘Cook R Graphics’ [Online]. O’Reilly. Available from: http://www.cookbook-r.com/Graphs/ [Accessed 26st July 2016].

O’Reilly (2016) ’Try R’ [Online]. O’Reilly. Available from: http://tryr.codeschool.com/ [Accessed 21st July 2016].

The R Foundation (2016) ‘What is R?’ [Online]. The R Foundation. Available from: https://www.r-project.org/about.html [Accessed 21st July 2016].

Leave a Reply

Your email address will not be published. Required fields are marked *