*Try R*

R is a language and environment for statistical computing and graphics.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R, like S, is designed around a true computer language, and it allows users to add additional functionality by defining new functions. Much of the system is itself written in the R dialect of S, which makes it easy for users to follow the algorithmic choices made. For computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users can write C code to manipulate R objects directly.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes:

An effective data handling and storage facility.

A suite of operators for calculations on arrays, in particular matrices.

A large, coherent, integrated collection of intermediate tools for data analysis.

Graphical facilities for data analysis and display either on-screen or on hardcopy.

A well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

*To see how it works, check out the Try R course from Code School –http://tryr.codeschool.com/*

*Shelter Animal Outcomes *

*Help improve outcomes for shelter animals*

Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.

In base of a dataset of intake information including breed, colour, sex, and age from the Austin Animal Centre (http://www.austintexas.gov/department/animal-services), will be predicted the outcome for each animal.

This dataset can help to understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home.

Is used the `randomForest`

classification algorithm to predict shelter outcomes. It is important to see how the outcomes are distributed for the *4800 cats *and *6656 dogs* in the training set.

*# Plot*

**ggplot**(outcomes, **aes**(x = AnimalType, y = num_animals, fill = OutcomeType)) +

**geom_bar**(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +

**coord_flip**() +

**labs**(y = ‘Proportion of Animals’,

x = ‘Animal’,

title = ‘Outcomes: Cats & Dogs’) +

**theme_few**()

Both cats and dogs are commonly adopted or transferred (cats more so), but dogs are much more likely to be returned to their owners than cats. It also appears that cats are more likely to have died compared to dogs. Fortunately, it appears very few animals die or get euthanized o*# *

*Plot*

**ggplot**(daytimes, **aes**(x = TimeofDay, y = num_animals, fill = OutcomeType)) +

**geom_bar**(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +

**facet_wrap**(~AnimalType) +

**coord_flip**() +

**labs**(y = ‘Proportion of Animals’,

x = ‘Animal’,

title = ‘Outcomes by Time of Day: Cats & Dogs’) +

**theme_few**()verall.

The conclusion of this plot is the fact that dogs are most often euthanized in the morning.

The Breed variable has way too many levels; 1678 to be exact be going to deal with this in part by contrasting mixes with non-mixes. It will also use strsplit and gsub to grab just the first breed if there are multiple breeds split by “/” and removing “Mix” from the mix.

*# Plot*

**ggplot**(intact, **aes**(x = Intact, y = num_animals, fill = OutcomeType)) +

**geom_bar**(stat = ‘identity’, position = ‘fill’, colour = ‘black’) +

**facet_wrap**(~AnimalType) +

**coord_flip**() +

**labs**(y = ‘Proportion of Animals’,

x = ‘Animal’,

title = ‘Outcomes by Intactness: Cats & Dogs’) +

**theme_few**()

Animals are much more likely to be adopted if they’ve been neutered. Smaller proportions of neutered animals end up euthanized or dying.

Because were 24 missing values of Age of Days, in the next step was used the rpart function to fit a decision tree predicting animal Age in Days by the great new variables were created or fixed.

It is evident that temporal factors should have anything to do with animals’ ages upon their respective outcomes, so not everything is thrown

*# Plot in ggplot2*

**ggplot**(full[1:26729, ], **aes**(x = Lifestage, fill = OutcomeType)) +

**geom_bar**(position = ‘fill’, colour = ‘black’) +

**labs**(y = ‘Proportion’, title = ‘Animal Outcome: Babies versus Adults’) +

**theme_few**()

Unsurprisingly, baby animals are more likely to be adopted than adult animals. They are also more likely to be transferred and to have died.

1 |
<span style="color: #0000ff;"><strong><span style="font-family: Courier New;">A randomForest</span> model predicting <code><span style="font-family: Courier New;">OutcomeType</span></code></strong></span> |

*# Split up train and test data*

train <- full[1:26729, ]

test <- full[26730:**nrow**(full), ]

*# Set a random seed*

**set.seed**(731)

*# Build the model*

rf_mod <- **randomForest**(OutcomeType ~ AnimalType+AgeinDays+Intact+HasName+Hour+Weekday+TimeofDay+SimpleColor+IsMix+Sex+Month,

data = train,

ntree = 600,

importance = TRUE)

*# Show model error*

**plot**(rf_mod, ylim=**c**(0,1))

**legend**(‘topright’, **colnames**(rf_mod$err.rate), col=1:6, fill=1:6)

# Use ggplot2 to visualize the relative importance of variables

ggplot(rankImportance, aes(x = reorder(Variables, Importance),

y = Importance)) +

geom_bar(stat=’identity’, colour = ‘black’) +

geom_text(aes(x = Variables, y = 0.5, label = Rank),

hjust=0, vjust=0.55, size = 4, colour = ‘lavender’,

fontface = ‘bold’) +

labs(x = ‘Variables’, title = ‘Relative Variable Importance’) +

coord_flip() +

theme_few()

The most important variable for predicting the outcomes of shelter animals is AgeinDays – and not Intact. Hour of the day and weekday are not doing too badly either. SimpleColor is ranked above Sex and IsMix.

### References

Kaggle Team (2016) ‘Shelter Animal Outcomes’ [Online]. Kaggle. Available from: https://www.kaggle.com/c/shelter-animal-outcomes [Accessed 20st July 2016].

Kurt Hornik, K. (2016) ‘Frequently Asked Questions on R’ [Online]. R FAQ. Available from: https://cran.r-project.org/doc/FAQ/R-FAQ.html [Accessed 21st July 2016].

O’Reilly (2016) ‘Cook R Graphics’ [Online]. O’Reilly. Available from: http://www.cookbook-r.com/Graphs/ [Accessed 26st July 2016].

O’Reilly (2016) ’Try R’ [Online]. O’Reilly. Available from: http://tryr.codeschool.com/ [Accessed 21st July 2016].

The R Foundation (2016) ‘What is R?’ [Online]. The R Foundation. Available from: https://www.r-project.org/about.html [Accessed 21st July 2016].