How to split Train and Test data in R
Today we’ll be seeing how to split data into Training data sets and Test data sets in R. While creating machine learning model we’ve to train our model on some part of the available data and test the accuracy of model on the part of the data.
There are two ways to split the data and both are very easy to follow:
1. Using Sample() function
#read the data data<- read.csv("data.csv") #create a list of random number ranging from 1 to number of rows from actual data and 70% of the data into training data data1 = sort(sample(nrow(data), nrow(data)*.7)) #creating training data set by selecting the output row values train<-data[data1,] #creating test data set by not selecting the output row values test<-data[-data1,]
Here sample() function work as : sample(value, size, replace)
> sample(10,7)
[1] 8 4 9 2 7 10 5
Then we’ll select only those rows using the output of sample function.
2. Using caTools Package:
#loading package library(caTools) #read the data data<- read.csv("data.csv") #use caTools function to split, SplitRatio for 70%:30% splitting data1= sample.split(data,SplitRatio = 0.3) #subsetting into Train data train =subset(data,data1==TRUE) #subsetting into Test data test =subset(data,data1==FALSE)
This was about splitting into Training and Test data set. Easy to follow.
Keep visiting Analytics Tuts for more tutorials.
Thanks for reading! Comment your suggestions and queries.
> sample = sample.split(data,SplitRatio = 0.75)
Error in unique.default(Y) : unique() applies only to vectors